Friday, November 6, 2009

Is Race Genetic? Part III - Witherspoon et al. Bring it Together

Previous parts: Part I, Part II

In a 2007 paper, Witherspoon et al.1 addressed the problem of reconciling three facts:

1. Genetic variation between major human populations (races) account for only a small fraction of allele frequency variation.

2. Multilocus statistics are capable of accurately assigning most individuals to the correct population of origin.

3. Individuals from different populations can still be more similar than individuals from the same population.

It's possible that the reason these facts emerged is because previous studies didn't look at the same populations. To determine if there's merit to these facts or if they are a product of the sample populations, Witherspoon et al. undertook a massive amount of data analysis. They used three different data sets, each focusing on a different kind of polymorphic locus, with loci numbers ranging from 175 to well into the thousands. The individuals were already divided into populations based on race. To measure the genetic similarities/dissimilarities between individuals, they calculated the "pairwise genetic distance." The pairwise genetic distance compares two individuals (hence the "pairwise") and determines the "genetic distance" between them based on their shared alleles. If, at a particular locus, two individuals share more alleles, they will have a smaller genetic distance. The overall pairwise genetic distance between two individuals is average of their per-locus distances.

Witherspoon et al. were interested in three values*:

1. Dissimilarity Fraction (DF): The DF is a measurement of "the probability that a pair of individuals randomly chosen from different populations is genetically more similar than an independent pair chosen from any single population." To determine it, they paired up all the individuals in all possible combinations and determined the pairwise genetic distance for each combination. These pairwise genetic distances could be classified as within- or between-population distances depending on how the individuals had been divided into populations based on race. They then calculated the frequency with which the within-population distances is greater than the between-population distances. If the frequency is very low, then it would mean that individuals are usually (if not always) more similar to members of their own population than to members of another population. If the frequency is high, the opposite is true.

2. Centroid Misclassification Rate (CMR): The CMR makes use of a modified pairwise genetic distance calculation, comparing each individual to the "centroid" of each population rather than to every other individual. The centroid is a pretend individual whose genetic makeup is the average of all the genetic information of the population. Essentially, it's like calculating the pairwise genetic distance to the average of all other individuals of the target population. Individuals get assigned to the population with the most similar centroid. The CMR is the proportion of individuals who get misclassified.

3. Population Trait Value Misclassification Rate (PMR): The PMR doesn't make use of the pairwise genetic distance calculation. Instead, it is based on the simplified model proposed by Edwards2 which I discussed in the previous post. In order to divide individuals into separate populations, you take two populations (say, A and B) and look at any particular locus. The allele which is more frequent in Population A is given a value of 0 while the allele which is more frequent in Population B is given a value of 1. An individual gets assigned a value at that locus based on the average of the values of his/her alleles at that locus. To determine the "population trait" value of the individual, you take an average of all the values at all the loci. A cutoff point is calculated based on the average population trait values of all individuals in Population A and Population B. If the individual is above the cutoff point, he/she is assigned to Population B. If the individual is below the cutoff point, he/she is assigned to Population A. Since this method is limited in that it can only divide individuals into two populations, Witherspoon et al. only considered an individual assigned to a particular population if, in all combinations where populations were paired off and population traits calculated, the individual was always assigned to the same population. The PMR, then, is the rate at which individuals were either misclassified or not classified at all.

As you may have noticed PMR and CMR are both methods for determining how accurately individuals get assigned to populations while DF is a measure how distinctive those populations are from each other. You would expect that all three values would act in the same way. That is, if DF goes down, so should CMR and PMR; if PMR goes up, so should CMR and DF, etc. So, what did Witherspoon et al. find?

First, they found that DF, CMR and PMR did follow the same trends. However, they discovered that DF decreases far more slowly than CMR and PMR. This is because both CMR and PMR don't compare individuals to individuals; instead, they compare individuals to aggregated information. CMR depends on a method which compares individuals to centroids rather than real people while PMR depends on a method which relies on allele frequency comparisons between populations. A lot of variation gets smoothed out. DF, on the other hand, is based on a method which compares individuals directly and is affected by the high amount of variation within groups. In other words, when comparing individuals to each other, you get pick up on a lot of their similarities as well as differences, and those can count for a lot. When comparing individuals to aggregated population information, only the trends matter.

Second, the number of loci used is important. The more loci used, the lower the values of DF, CMR and PMR. This is because as you get more information, you get a better chance of accurately correlating data or finding differences between individuals. In certain circumstances (more on that below), if you get enough data from enough loci DF, CMR and PMR approach 0%. In other words, with enough information, it is possible to always accurately assign someone to a population and that individual will never be more similar to someone from a different population.

Third, the closeness of the populations matter. When populations are geographically distinct and there is little chance of breeding between populations, DF, CMR and PMR can become negligible, though in the case of DF you will need to have information from thousands of loci. When populations are more closely related to each other (due to less geographical distinctiveness and more historical breeding between populations), then DF, CMF and PMR never reach 0%, even when over 10,000 loci are used.

In summary, Witherspoon et al. show that it's not contradictory for individuals between populations to be more similar than individuals within populations while still having accurate population assignment of individuals. Within population diversity can be exceptionally high, and distinct populations really do exist.

So, does that mean that race really is genetic? Nope. Or, I should say, not exactly.

Tune in for Part IV - Teasing Apart Race, Populations and Genetics

* These aren't the symbols/abbreviations that are used in the paper, but I think the ones that I'm using here are a little easier to follow.

1. Witherspoon, DJ, Wooding, S, Rogers, AR, Marchani, EE, Watkins, WS, Batzer MA, Jorde, LB. 2007. Genetic similarities within and between human populations. Genetics 176: 351-359.

2. Edwards, AW. 2003. Human genetic diversity: Lewontin's fallacy. BioEssays 25: 798-801.


  1. Hi, are you going to continue the series?

  2. Thanks for asking! Life has kicked me in the teeth, but I've just finished banging my head against the last post of the series. :)

    You can find it here:

    Thanks for reading!