Genetics Made Complicated: November 2009

Wednesday, November 25, 2009

Is Race Genetic? Part IV: Teasing Apart Race, Populations and Genetics

Many apologies for the lateness of this post.

Previous parts: Part I, Part II, Part III

You may have noticed in previous parts of this series that I use the term "population" more often than "race." This is deliberate: if you want to determine if race is genetic, you can't assume that the genetic populations into which you divide individuals are synonymous with race. You have to first see if it's possible to divide individuals based on genetic characteristics and then determine if those populations correspond to what we call races.

In Part I, I said that a population is "is a distinct group of interbreeding people, often associated with geography. For instance, a population might be the group of people living in a single town." This is correct but not the full story. What the individuals of a population actually share is common ancestry. This isn't to say that we can pinpoint precisely the identities of these ancestors or that the individuals of a population are related in an appreciable fashion (by which I mean in a context that matters for social or cultural reasons). Nor does it mean that the humans of the world can be separated into more than one population. The belief that it's possible to separate people into genetic populations (for lack of a better term) is that geographic isolation and the issue of travel made it more likely that certain groups of people would breed amongst themselves, leading to allele frequencies being skewed in some direction or another. Natural selection might also work to skew allele frequencies. The question being answered by Witherspoon et al.'s study (not to mention many other studies) is whether or not there has been enough non-random breeding that certain populations have common ancestors that make them distinct from other populations.

With that in mind, let's break down the results that Witherspoon et al. got.

1. It's possible to separate groups of people into populations.

This is the obvious one. If you look at enough loci, it is possible to separate individuals into distinct populations. Furthermore, it is possible to separate individuals so that most individuals only belonged to one population.

2. Some individuals belonged to more than population.

Witherspoon et al. talked about individuals who had been misclassified because the methods for separating individuals placed them in a different population than what was expected. You may recall that the "expected" population was tied to the individual's race. If you look at the data, what you find is that the individual wasn't classified into the wrong population so much as the individual was a fringe case who could have potentially belonged to more than one population. The sorting method arbitrarily put the individual into one population or the other because the sorting method didn't allow for the possibility that the individual could belong to more than one population. If populations are caused by common ancestry, however, then an individual who gets sorted into more than one population is simply an individual who shares common ancestors with individuals from more than one population. When you consider that there was almost always individuals who could not be sorted into a single population in groups of people living very close to each other (where, presumably, breeding between groups could occur), this should come as no surprise.

3. The study looked at hundreds of loci, many of which have no obvious effect.

In order to make the "misclassfication" rate as low as possible, Witherspoon et al. often looked at hundreds of different loci. Many of those loci (I even venture to say most, though I can't confirm that) have no physical effect whatsoever. I don't want to get into excessive detail about basic biology, but it relies on the idea that any differences in the DNA is inheritable, whether it has a physical effect or not. Changing a single base in the DNA sequence may have no effect whatsoever (because of the redundant nature of the DNA code, or because it's in an area which doesn't code for anything) but it's still a change that can be called a different allele. It's kind of like the difference between "colour" and "color." The spelling is slightly different, but they mean the same thing.

4. Populations are not divided by the absence or presence of a few traits but, rather, by frequencies of alleles.

This is tied closely to 3. If it took simply the presence or absence of a few traits/alleles to differentiate populations, Witherspoon et al. would not have had to look at the frequency of hundreds of alleles in order to separate populations which were very far away from each other. This means that it is nigh on impossible to find an allele which is only every found in one population and never found in another population. It may be true that 99% of the individuals of Population A will have allele X while only 1% of Population B will have allele X, but allele X still persists in Population B and its existence isn't sufficient to sort an individual into one population or another. This also ties to the idea that sometimes you can find individuals between populations who are more similar than individuals within a population.

So, now we know lots of populations. The next question is whether those populations correspond to races. Since Witherspoon et al. could classify individuals into the expected populations defined by race, it would be fair to say that there is a correlation between race and genetic population. It would also be fair, however, to say that the correlation is not perfect. If we're just talking about genetic populations, then "misclassification" isn't an issue. It was, however, in Witherspoon et al.'s study, since the individuals were identified as part of a distinct race, and some could be classified into the wrong race. It could, of course, be argued that these individuals were "mixed-race" and were not misclassified at all - they were simply misidentified from the start. That would make the correlation between populations and races very strong, wouldn't it?

Of course, if genetic populations and races are the same thing, then it would mean that all the things we learned about populations applies to races. Races would be defined by hundreds of differences which have no discernible effect whatsoever on an individual's actual physical characteristics; the differences would be based on shared ancestry. The presence or absence of a few traits would be insufficient to classify an individual into a particular race, and no trait "belongs" to one race or another - they would be present in all races. Individuals between races could be more similar than individuals within the same race.

That actually doesn't sound too bad to me. The problem, of course, is that the question "Is race genetic?" isn't actually about finding out of people can genetically be split into populations defined by allele frequencies. It's about whether or not we can find a biological basis for concretely separating people into different groups which fit with the ways we've already decided to use to separate people. We want to be able to look at a person and say, "He belongs to race A because he has blue skin. She belongs to race B because she has orange skin." Instead, the data shows us that all we can do is say, "Well, he probably belongs to race A because there are more people with blue skin there. And she probably belongs to race B because there are more people with orange skin there. And, actually, they could both belong to race C. Or maybe skin colour has nothing to do with it. Let me check a hundred more loci...."

Race may correlate with genetic population, but a genetic population is not a race. Race transcends the biological, involving culture, language, upbringing, place of origin, and a host of other factors. I'd even say that these factors are even more important the biological, mainly because the biological says that we're really not all that different.

Reference
1. Witherspoon, DJ, Wooding, S, Rogers, AR, Marchani, EE, Watkins, WS, Batzer MA, Jorde, LB. 2007. Genetic similarities within and between human populations. Genetics 176: 351-359.

Friday, November 6, 2009

Is Race Genetic? Part III - Witherspoon et al. Bring it Together

Previous parts: Part I, Part II

In a 2007 paper, Witherspoon et al.¹ addressed the problem of reconciling three facts:

1. Genetic variation between major human populations (races) account for only a small fraction of allele frequency variation.

2. Multilocus statistics are capable of accurately assigning most individuals to the correct population of origin.

3. Individuals from different populations can still be more similar than individuals from the same population.

It's possible that the reason these facts emerged is because previous studies didn't look at the same populations. To determine if there's merit to these facts or if they are a product of the sample populations, Witherspoon et al. undertook a massive amount of data analysis. They used three different data sets, each focusing on a different kind of polymorphic locus, with loci numbers ranging from 175 to well into the thousands. The individuals were already divided into populations based on race. To measure the genetic similarities/dissimilarities between individuals, they calculated the "pairwise genetic distance." The pairwise genetic distance compares two individuals (hence the "pairwise") and determines the "genetic distance" between them based on their shared alleles. If, at a particular locus, two individuals share more alleles, they will have a smaller genetic distance. The overall pairwise genetic distance between two individuals is average of their per-locus distances.

Witherspoon et al. were interested in three values*:

1. Dissimilarity Fraction (DF): The DF is a measurement of "the probability that a pair of individuals randomly chosen from different populations is genetically more similar than an independent pair chosen from any single population." To determine it, they paired up all the individuals in all possible combinations and determined the pairwise genetic distance for each combination. These pairwise genetic distances could be classified as within- or between-population distances depending on how the individuals had been divided into populations based on race. They then calculated the frequency with which the within-population distances is greater than the between-population distances. If the frequency is very low, then it would mean that individuals are usually (if not always) more similar to members of their own population than to members of another population. If the frequency is high, the opposite is true.

2. Centroid Misclassification Rate (CMR): The CMR makes use of a modified pairwise genetic distance calculation, comparing each individual to the "centroid" of each population rather than to every other individual. The centroid is a pretend individual whose genetic makeup is the average of all the genetic information of the population. Essentially, it's like calculating the pairwise genetic distance to the average of all other individuals of the target population. Individuals get assigned to the population with the most similar centroid. The CMR is the proportion of individuals who get misclassified.

3. Population Trait Value Misclassification Rate (PMR): The PMR doesn't make use of the pairwise genetic distance calculation. Instead, it is based on the simplified model proposed by Edwards² which I discussed in the previous post. In order to divide individuals into separate populations, you take two populations (say, A and B) and look at any particular locus. The allele which is more frequent in Population A is given a value of 0 while the allele which is more frequent in Population B is given a value of 1. An individual gets assigned a value at that locus based on the average of the values of his/her alleles at that locus. To determine the "population trait" value of the individual, you take an average of all the values at all the loci. A cutoff point is calculated based on the average population trait values of all individuals in Population A and Population B. If the individual is above the cutoff point, he/she is assigned to Population B. If the individual is below the cutoff point, he/she is assigned to Population A. Since this method is limited in that it can only divide individuals into two populations, Witherspoon et al. only considered an individual assigned to a particular population if, in all combinations where populations were paired off and population traits calculated, the individual was always assigned to the same population. The PMR, then, is the rate at which individuals were either misclassified or not classified at all.

As you may have noticed PMR and CMR are both methods for determining how accurately individuals get assigned to populations while DF is a measure how distinctive those populations are from each other. You would expect that all three values would act in the same way. That is, if DF goes down, so should CMR and PMR; if PMR goes up, so should CMR and DF, etc. So, what did Witherspoon et al. find?

First, they found that DF, CMR and PMR did follow the same trends. However, they discovered that DF decreases far more slowly than CMR and PMR. This is because both CMR and PMR don't compare individuals to individuals; instead, they compare individuals to aggregated information. CMR depends on a method which compares individuals to centroids rather than real people while PMR depends on a method which relies on allele frequency comparisons between populations. A lot of variation gets smoothed out. DF, on the other hand, is based on a method which compares individuals directly and is affected by the high amount of variation within groups. In other words, when comparing individuals to each other, you get pick up on a lot of their similarities as well as differences, and those can count for a lot. When comparing individuals to aggregated population information, only the trends matter.

Second, the number of loci used is important. The more loci used, the lower the values of DF, CMR and PMR. This is because as you get more information, you get a better chance of accurately correlating data or finding differences between individuals. In certain circumstances (more on that below), if you get enough data from enough loci DF, CMR and PMR approach 0%. In other words, with enough information, it is possible to always accurately assign someone to a population and that individual will never be more similar to someone from a different population.

Third, the closeness of the populations matter. When populations are geographically distinct and there is little chance of breeding between populations, DF, CMR and PMR can become negligible, though in the case of DF you will need to have information from thousands of loci. When populations are more closely related to each other (due to less geographical distinctiveness and more historical breeding between populations), then DF, CMF and PMR never reach 0%, even when over 10,000 loci are used.

In summary, Witherspoon et al. show that it's not contradictory for individuals between populations to be more similar than individuals within populations while still having accurate population assignment of individuals. Within population diversity can be exceptionally high, and distinct populations really do exist.

So, does that mean that race really is genetic? Nope. Or, I should say, not exactly.

Tune in for Part IV - Teasing Apart Race, Populations and Genetics

* These aren't the symbols/abbreviations that are used in the paper, but I think the ones that I'm using here are a little easier to follow.

1. Witherspoon, DJ, Wooding, S, Rogers, AR, Marchani, EE, Watkins, WS, Batzer MA, Jorde, LB. 2007. Genetic similarities within and between human populations. Genetics 176: 351-359.

2. Edwards, AW. 2003. Human genetic diversity: Lewontin's fallacy. BioEssays 25: 798-801.

Wednesday, November 4, 2009

Is Race Genetic? Part II - Lewontin's Fallacy

Part I here

First, a quick reminder of the last post: Lewontin's study statistical analysis involved looking at the genetic diversity of 17 different loci and he concluded that most of the genetic diversity was due to individuals being individuals, not because of any racial grouping. Therefore, racial groupings are obsolete.

In the 2003 paper Human genetic diversity: Lewontin's fallacy¹, A. W. F. Edwards wrote:

It is not true that "racial classification is... of virtually no genetic or taxonomic significance". It is not true, as Nature claimed, that "two random individuals from any one group are almost as different as any two random individuals from the entire world", and it is not true, as the New Scientist claimed, that "two individuals are different because they are individuals, not because they belong to different races" and that "you can't predict someone's race by their genes".

Them's fightin' words! Let's see what guns he brings to the duel.

Edwards wrote, "There is nothing wrong with Lewontin's statistical analysis of variation, only with the belief that it is relevant to classification." In other words, he doesn't dispute Lewontin's findings but rather how they should be interpreted. Edwards claims that, while diversity can be important, it's the correlations between the genes that differentiates races, not the diversity. If you correlate multiple genes, it becomes possible to reliably determine to which population a person belongs, and those populations can correlate with what we call race. In other words, you really could look at someone's genes and figure out their race.

Let's make up a simple example. In Population A, half the people have blue hair while the other half have green.* In Population B, this is true as well. This means that if you've got someone standing in front of you with green hair, you can't tell to which population he belongs. Now, let's add in eye colour (red and purple). Half the people of Population A have red eyes while the other half have purple eyes. This also holds true for Population B. To which population does a red-eyed, green-haired person belong? Based on Lewontin's analysis, you wouldn't be able tell. Edwards points out that you can look how eye colour and hair colour correlate with each other. If red eyes and green hair tended to correlate together in Population A while purple eyes and green hair correlate in Population B, then you could guess that the red-eyed, green-haired person belongs to Population A.

Statistical methods allows you to factor in allele frequencies as well as how alleles correlate with each other. As you add more genes with more alleles into the mix, it becomes increasingly likely that you'll be able to accurately predict to which population a person belongs. Edwards calculates that, using data which simulates Lewontin's level of within/without population diversity and assuming only two different alleles at each locus, you need to correlate about 20 traits before the chance of misclassifying someone into the wrong population becomes negligible (below 1%). Studies using actual data from individuals have shown "genetic affinities that have unsurprising geographic, linguistic and cultural parallels." In other words, researchers could look at a pool of individuals and, by using statistics, cluster the individuals into populations based on various polymorphic loci, and those clusters could correspond to what we call "race."

So, despite the high amount of within population diversity, it looks like there really is a genetic basis for race. But hold on! In 2004, Bamshad et al.² published a paper using multilocus statistics (that is, the kind of correlation statistics that Edwards advocates) and nearly 400 polymorphic loci that showed that two individuals from different populations could often be more similar than two individuals from the same population. What gives?

Check out Part III - Witherspoon et al. Bring it Together

*I'm picking physical characteristics for this example because they're easily visualized, not because they make good loci.

1. Edwards, AW. 2003. Human genetic diversity: Lewontin's fallacy. BioEssays 25: 798-801.

2. Bamshad, M, Wooding, S, Salisbury, BA and Stephens, JC. 2004. Deconstructing the relationship between genetics and race. Natural Reviews Genetics 5: 598-609.

Tuesday, November 3, 2009

Is Race Genetic? Part I - Lewontin 1972

The idea that there is more diversity within races than between races has been floating around for a few decades. It's origin is in a 1972 paper published by Richard Lewontin called "The Apportionment of Human Diversity"¹. Let's have a go at what he actually did.

First, he divided people into a hierarchical scheme:

population--> race--> species (Home sapiens)

A population is a distinct group of interbreeding people, often associated with geography. For instance, a population might be the group of people living in a single town. A race would therefore be a larger group of people to which several populations belong. For his paper, Lewontin classified populations into 7 races: Caucasian, African, Mongoloid, S. Asian Aborigines, Amerinds, Oceanians, and Australian Aborigines. The final level was Home sapiens itself. The hierarchical setup is necessary to make the "within race/between race" distinction. Arguably, individuals within a population will be more genetically similar to each other than to another population. This makes sense since they are living in the same environment, breeding with each other, etc. If race is an important factor, then populations will be more genetically similar to other populations within the same racial group than to populations in a different racial group.

Looking at it from another direction, let's say I have Population A in Race X. It has a genetic diversity of 1 (this is a made-up number that doesn't mean anything real; assume that the lower the number, the lower the genetic diversity). Population B of Race X has a genetic diversity of 1 as well. When I mix the data from Population A and Population B together, I get a genetic diversity of 4. This makes sense if Population A and Population B are genetically similar within the populations but there are differences between the populations. If racial groupings can be determined using genetics, then I would expect that if Population A was from Race X but Population B was from Race Y, then the genetic diversity would increase a lot (to, say, 87).

Clearly, then, everything hinges on what we mean by genetic diversity. In Lewontin's paper, he looked at 17 different polymorphic loci. Essentially, this means that he looked at different regions of DNA which can show differences between people. Note that I didn't say "genes." A polymorphic locus does not need to be connected to any gene (though it can be). It simply must be a stretch of DNA which is sometimes different in different people. Each different version of the locus is called an allele. Looking at many different loci makes it more likely that you'll be able to find real differences or similarities, rather than an anomaly that might exist at a single locus. To determine diversity, he looked at the frequency of the alleles at each locus. The more alleles there are, the more diverse it is (because there are more versions of the locus); the more equal the frequencies of the alleles, the more diverse the population (because if one allele was at a very high frequency, then more people would be the same, leading to less diversity).

After some number crunching, Lewontin found:

The mean proportion of the total species diversity that is contained within populations is 85.4%.... Less than 15% of all human genetic diversity is accounted for by differences between human groups! Moreover, the difference between populations within a race accounts for an additional 8.3%, so that only 6.3% is accounted for by racial classification.

What does that mean? When you start off with a single population, there will be a certain allele frequencies for each gene, and there will be a certain number of alleles in the population. When you pool another population with the first, you would expect a change to the allele frequencies and the number alleles unless the populations are very similar. As you add more populations, you would expect more diversity. Once you pooled all of the populations together, that should be the maximum diversity. So, when Lewontin found that 85.4% of total species diversity was found within a population, it means that when you look at a single population, 85.4% of the maximum diversity was already there. Add in more populations within the same race, and that accounted for another 8.3%. If you pooled all the races together, that accounted for the last 6.3%. Lewontin goes on to conclude:

It is clear that our perception of relatively large differences between human races and subgroups, as compared to the variation within these groups, is indeed a biased perception and that, based on randomly chosen genetic differences, human races and populations are remarkably similar to each other, with the largest part by far of human variation being accounted for by the differences between individuals. Human racial classification is of no social value and is positively destructive of social and human relations. Since such racial classification is now seen to be of virtually no genetic or taxonomicsignificance either,no justification can be offered for its continuance.

Though this seems cut and dried, it's not the entire picture. Check out Part II - Lewontin's Fallacy.

1. Lewontin RC. The apportionment of human diversity. In: Dobzhansky T, Hecht MK, Steere WC, editors. Evolutionary Biology 6. New York: Appleton-Century-Crofts. 1972. p 381–398.

Monday, November 2, 2009

Is Race Genetic? Part 0 - Preview

Over the next week or so, I'll be making posts about the genetic basis of race. My original intention was to make a post about the science and then go into the implications, but the subject turned out to have far more meat than I expected. The more you know....

As the posts come out, I will edit this post with links to each post so this one can act as a sort of index.

Parts:
Part I - Lewontin 1972
Part II - Lewontin's Fallacy
Part III - Witherspoon et al. Bring it Together
Part IV - Tease Apart Race, Populations and Genetics