Friday, September 5, 2014

Worldwide Population Y-DNA Collated (Xu et al.) [Review]

Approximately one week has passed since a new paper by Xu et al. was indexed by PubMed and made available online ahead of printing:

"The Y chromosome is one of the best genetic materials to explore the evolutionary history of human populations. Global analyses of Y chromosomal short tandem repeats (STRs) data can reveal very interesting world population structures and histories. However, previous Y-STR works tended to focus on small geographical ranges or only included limited sample sizes. In this study, we have investigated population structure and demographic history using 17 Y chromosomal STRs data of 979 males from 44 worldwide populations. The largest genetic distances have been observed between pairs of African and non-African populations. American populations with the lowest genetic diversities also showed large genetic distances and coancestry coefficients with other populations, whereas Eurasian populations displayed close genetic affinities. African populations tend to have the oldest time to the most recent common ancestors (TMRCAs), the largest effective population sizes and the earliest expansion times, whereas the American, Siberian, Melanesian, and isolated Atayal populations have the most recent TMRCAs and expansion times, and the smallest effective population sizes. This clear geographic pattern is well consistent with serial founder model for the origin of populations outside Africa. The Y-STR dataset presented here provides the most detailed view of worldwide population structure and human male demographic history, and additionally will be of great benefit to future forensic applications and population genetic studies."

This paper showcases a staggering 979 distinct Y-DNA 17 STR haplotypes across 44 distinct populations from across the world. These haplotypes are soon to be uploaded to the Y-STR Haplotype Resource Database (YHRD). The authors have made all the haplotypes, together with a slew of additional information, publicly available independent of the official article (raw haplotypes, Y-DNA haplogroup predictions).

In this entry, the collated results of all populations are reviewed, together with cursory inferences provided with the intention of aiding interpreting them.

All 979 haplotypes were retrieved through the above link. Each population dataset was run through Vadim Urasin's YPredictor (v1.5.0). A 70% prediction strength threshold was implemented. All nomenclature were reduced to the haplogroup level to avoid confusion for future readers should these change in time. These haplotypes formed the collated population results.

877 haplotype predictions met the 70% threshold established. Without having access to the original study, it is apparent that the authors also used Urasin's YPredictor, given the identical predictions.

The collated population results have been organised by the location of sampling by continent or region and can be found in the Data Sink. Direct links to each section accompanied by the list of populations sampled are listed below for the reader's convenience with a brief runthrough of some interesting findings under each.

1. Europe Adygei (Russia), Chuvash (Russia), Danes (Denmark), Finns (Finland), Hungarians (Hungary), Irish (Ireland), Khanty (Russia), Komi (Russia), Russians (Archangelsk), Russians (Vologda), Yakut (Russia)

The Adygei present as expected; they are predominantly G-P15 and J-L26 with various subclades of haplogroup R. Various subclades of haplogroups N and R define the Chuvash, with an additional appearance by J-L26 and Q-MEH2. Ethnic Russian populations appear to have their own regionalised diversity on the backdrop of being predominantly R-M198 and downstream subclades (particularly R-M458). The Irish are predominantly (~81%) R-M269, although the presence of a single man with H-M82 is surprising. Finally, the Yakut too belong overwhelmingly to haplogroup N (~78%) with a single man being predicted as I-P37.2.

2. Middle-East Druze (Israel), Samaritans (Israel), Yemenite Jews (Yemen)

The Druze are one of the better-sampled populations in this study, where they are mostly represented by various subclades of haplogroups E and G, together with R-M269 and T-L162. The Samaritans are defined (in order of decreasing frequency) exclusively by J-L26, J-P58 and E-V22. Finally, the Yemenite Jews present with a similar (though more restricted) spectrum as the Druze with some differences in frequency.

3. East Asia Ami (Taiwan), Atayal (Taiwan), Cambodians (Cambodia), Chinese (USA), Chinese (Taiwan), Hakka (Taiwan), Japanese (USA), Koreans (S. Korea), Laotians (Laos)

The Ami are unsurprisingly defined mostly by downstream subclades of haplogroup O, although there does appear to be an I-M223 and L-M317 among them. The Atayal, also of Taiwan, are exclusively O-MSY2.2. The Cambodians appear to have even more lineages which are typically expected further west. The Japanese boast the highest frequency of D-M55 out of all the populations sampled (21.1%). The Korean results contrast with this through the presence of men with N*-LLY22g(xM128,P43,Tat) and Q-MEH2. The Laotians appear to have one man with DE*-M1, although this will require SNP testing to definitively confirm.

4. Africa Ashkenazi Jews (S. Africa), Biaka Pygmies (CAR), Chagga's (Tanzania), Ethiopian Jews (Ethiopia), Hausa (Nigeria), Ibo (Nigeria), Masai (Tanzania-Kenya), Mbuti Pgymies (Congo R.), Sandawe (Tanzania), Yoruba (Nigeria)

The Ashkenazi Jews of South Africa appear to have a Y-DNA spectrum that is completely typical of Southwest Asians (please compare with the Druze). The Bagandu are largely defined by subclades of haplogroups B and E. Tanzanians here are completely haplogroup E and T. The presence of G-M15, J-L26 and R-M269 among the Hausa is surprising and may be attributed to a colonial European presence or some other forms of interaction.  The Sandawe have some rather unusual results given their geographical position (I-P37.2 and Q-MEH2), raising the possibility these haplotypes were predicted incorrectly.

5. Australasia Micronesians (Micronesia), Nasioi Melanesians (Solomon Islands)

Both the Micronesians and Melanesians have an unusually diverse spectrum. It is difficult to ascertain whether the parahaplogroups shown are genuine or, as described above, a result of incorrect predictions. A recent paper revealing the presence of newly discovered offshoots from haplogroup K in Southeast Asia [1] raise the possibility some of these may be genuine.

6. Americas Karitiana (Brazil), African Americans (USA), European Americans (USA), Maya (Mexico), Pima (USA), Rondonian Surui (Brazil), Ticuna (Brazil)

The Karitiana are predominantly Q-MEH2 but appear to have some non-American admixture through E-U175. African Americans are represented as an approximately 4:6 mix of R-M269 against various haplogroup E subclades. The Maya population, like the Karitiana, are Q-MEH2 with additional markers from outside the Americas, as are the Pima. The trend continues with the Quechua people, although C-M217 and T-L162 make their first appearance here. Finally, the Rondonian Surui and Ticuna are completely Q-MEH2.

There are at least two areas of the authors' methodology which are deemed to be drawbacks and prevent this study from being exceptionally informative.

Firstly, the authors evidently used the YFiler sampling array to complete this investigation. In an era where commercial testees can enjoy upwards of 111 Y-STR's, the long-term usefulness of this paper's extensive worldwide sampling is cut short. Another recent paper presenting Y-STR's worldwide has done so using 23 rather than just 17. [2]

My comments are more critical of the authors' sampling strategy. More data is never strictly a burden in the world of population genetics, but the informativeness of groups such as "European Americans", "Irish" and Chinese born in the USA is questionable. For instance, these groups are already richly represented, be it in the current literature or FTDNA Project groups. The apparent issue with these samples would have been rectified if they were simply obtained from a single area, providing regional specificity which may prove useful in better establishing genetic variation within Ireland, for example.

Finally, the haplotypes could have also received a "backbone" SNP test each to definitively place them within the current phylogeny. The drawbacks of STR-alone testing became readily apparent with some of the African samples. I can only speculate it is the highly divergent nature of certain uniquely African haplotypes from Eurasian ones which produced these spurious results.

On Mutation Rates (Quick Discussion)
In this study, both BATWING and the average squared distance (ASD) method were used. Within each, four different mutation rates were implemented. On initial inspection these appear to vary wildly. However, on closer examination, it appears all the BATWING most recent common ancestor (MRCA) calculated ages are approximately twice as old as those generated by the ASD method. Even within each technique there is substantial variation; the evolutionary rate appears approximately three times greater than the others. Furthermore, these "other" mutation rates do tend to congregate around a common similar value (e.g. through BATWING, the calculated global age of their Y-DNA R-M198 haplotypes was 5.5k, 6.1k and 6.2kya), which would intuitively suggest the "actual" value lies somewhere within these either through BATWING or ASD. The discrepancy here cannot be overstated and calls into question why some researchers are still utilising a "blanket" mutation rate across several loci which are shown to have significantly different tendencies to mutate (colloquially described as "slow", "medium" and "fast" mutators). I am uncertain whether the authors are in fact doing this, but the implications of this are apparent, as they prevent rational "fitting" of these numbers into candidate prehistoric narratives from happening. This entire topic will likely be explored in a future entry.

Although at least three drawbacks (four including the MRCA calculations) are identified here, this study provides researchers worldwide with a plethora of data from populations that are either poorly represented in the current literature or have been entirely absent until present. The majority of the results outline the wide Y-chromosomal diversity across the world, whilst also revealing specific trends that have been established in both the current literature and in online discussion boards. An mtDNA counterpart of this paper would be a wonderful addition to see sometime in the near future.

There is a bountiful amount of data to be interpreted with pre-existing ideas/models and compared with prior studies which place a premium on each population's area. I welcome any form of dialogue regarding the results. There, is, for many of us, plenty to elucidate. The conclusion does not end here; I encourage as much further investigation and thought by the readers as the data permits.

[Addendum @ 05/09/2014]: Error regarding Karitiana data. Modified and updated.

1. Karafet TM, Mendez FL, Sudoyo H, Lansing JS, Hammer MF. Improved phylogenetic resolution and rapid diversification of Y-chromosome haplogroup K-M526 in Southeast Asia. [Last Retrieved 03/09/2014]: 

2. Purps J, Siegert S, Willuweit S, Nagy M, Alves C, Salazar R et al. A global analysis of Y-chromosomal haplotype diversity for 23 STR loci. [Last Retrieved 05/09/2014]: