Friday, September 5, 2014

Worldwide Population Y-DNA Collated (Xu et al.) [Review]

Approximately one week has passed since a new paper by Xu et al. was indexed by PubMed and made available online ahead of printing:

"The Y chromosome is one of the best genetic materials to explore the evolutionary history of human populations. Global analyses of Y chromosomal short tandem repeats (STRs) data can reveal very interesting world population structures and histories. However, previous Y-STR works tended to focus on small geographical ranges or only included limited sample sizes. In this study, we have investigated population structure and demographic history using 17 Y chromosomal STRs data of 979 males from 44 worldwide populations. The largest genetic distances have been observed between pairs of African and non-African populations. American populations with the lowest genetic diversities also showed large genetic distances and coancestry coefficients with other populations, whereas Eurasian populations displayed close genetic affinities. African populations tend to have the oldest time to the most recent common ancestors (TMRCAs), the largest effective population sizes and the earliest expansion times, whereas the American, Siberian, Melanesian, and isolated Atayal populations have the most recent TMRCAs and expansion times, and the smallest effective population sizes. This clear geographic pattern is well consistent with serial founder model for the origin of populations outside Africa. The Y-STR dataset presented here provides the most detailed view of worldwide population structure and human male demographic history, and additionally will be of great benefit to future forensic applications and population genetic studies."

This paper showcases a staggering 979 distinct Y-DNA 17 STR haplotypes across 44 distinct populations from across the world. These haplotypes are soon to be uploaded to the Y-STR Haplotype Resource Database (YHRD). The authors have made all the haplotypes, together with a slew of additional information, publicly available independent of the official article (raw haplotypes, Y-DNA haplogroup predictions).

In this entry, the collated results of all populations are reviewed, together with cursory inferences provided with the intention of aiding interpreting them.

All 979 haplotypes were retrieved through the above link. Each population dataset was run through Vadim Urasin's YPredictor (v1.5.0). A 70% prediction strength threshold was implemented. All nomenclature were reduced to the haplogroup level to avoid confusion for future readers should these change in time. These haplotypes formed the collated population results.

877 haplotype predictions met the 70% threshold established. Without having access to the original study, it is apparent that the authors also used Urasin's YPredictor, given the identical predictions.

The collated population results have been organised by the location of sampling by continent or region and can be found in the Data Sink. Direct links to each section accompanied by the list of populations sampled are listed below for the reader's convenience with a brief runthrough of some interesting findings under each.

1. Europe Adygei (Russia), Chuvash (Russia), Danes (Denmark), Finns (Finland), Hungarians (Hungary), Irish (Ireland), Khanty (Russia), Komi (Russia), Russians (Archangelsk), Russians (Vologda), Yakut (Russia)

The Adygei present as expected; they are predominantly G-P15 and J-L26 with various subclades of haplogroup R. Various subclades of haplogroups N and R define the Chuvash, with an additional appearance by J-L26 and Q-MEH2. Ethnic Russian populations appear to have their own regionalised diversity on the backdrop of being predominantly R-M198 and downstream subclades (particularly R-M458). The Irish are predominantly (~81%) R-M269, although the presence of a single man with H-M82 is surprising. Finally, the Yakut too belong overwhelmingly to haplogroup N (~78%) with a single man being predicted as I-P37.2.

2. Middle-East Druze (Israel), Samaritans (Israel), Yemenite Jews (Yemen)

The Druze are one of the better-sampled populations in this study, where they are mostly represented by various subclades of haplogroups E and G, together with R-M269 and T-L162. The Samaritans are defined (in order of decreasing frequency) exclusively by J-L26, J-P58 and E-V22. Finally, the Yemenite Jews present with a similar (though more restricted) spectrum as the Druze with some differences in frequency.

3. East Asia Ami (Taiwan), Atayal (Taiwan), Cambodians (Cambodia), Chinese (USA), Chinese (Taiwan), Hakka (Taiwan), Japanese (USA), Koreans (S. Korea), Laotians (Laos)

The Ami are unsurprisingly defined mostly by downstream subclades of haplogroup O, although there does appear to be an I-M223 and L-M317 among them. The Atayal, also of Taiwan, are exclusively O-MSY2.2. The Cambodians appear to have even more lineages which are typically expected further west. The Japanese boast the highest frequency of D-M55 out of all the populations sampled (21.1%). The Korean results contrast with this through the presence of men with N*-LLY22g(xM128,P43,Tat) and Q-MEH2. The Laotians appear to have one man with DE*-M1, although this will require SNP testing to definitively confirm.

4. Africa Ashkenazi Jews (S. Africa), Biaka Pygmies (CAR), Chagga's (Tanzania), Ethiopian Jews (Ethiopia), Hausa (Nigeria), Ibo (Nigeria), Masai (Tanzania-Kenya), Mbuti Pgymies (Congo R.), Sandawe (Tanzania), Yoruba (Nigeria)

The Ashkenazi Jews of South Africa appear to have a Y-DNA spectrum that is completely typical of Southwest Asians (please compare with the Druze). The Bagandu are largely defined by subclades of haplogroups B and E. Tanzanians here are completely haplogroup E and T. The presence of G-M15, J-L26 and R-M269 among the Hausa is surprising and may be attributed to a colonial European presence or some other forms of interaction.  The Sandawe have some rather unusual results given their geographical position (I-P37.2 and Q-MEH2), raising the possibility these haplotypes were predicted incorrectly.

5. Australasia Micronesians (Micronesia), Nasioi Melanesians (Solomon Islands)

Both the Micronesians and Melanesians have an unusually diverse spectrum. It is difficult to ascertain whether the parahaplogroups shown are genuine or, as described above, a result of incorrect predictions. A recent paper revealing the presence of newly discovered offshoots from haplogroup K in Southeast Asia [1] raise the possibility some of these may be genuine.

6. Americas Karitiana (Brazil), African Americans (USA), European Americans (USA), Maya (Mexico), Pima (USA), Rondonian Surui (Brazil), Ticuna (Brazil)

The Karitiana are predominantly Q-MEH2 but appear to have some non-American admixture through E-U175. African Americans are represented as an approximately 4:6 mix of R-M269 against various haplogroup E subclades. The Maya population, like the Karitiana, are Q-MEH2 with additional markers from outside the Americas, as are the Pima. The trend continues with the Quechua people, although C-M217 and T-L162 make their first appearance here. Finally, the Rondonian Surui and Ticuna are completely Q-MEH2.

There are at least two areas of the authors' methodology which are deemed to be drawbacks and prevent this study from being exceptionally informative.

Firstly, the authors evidently used the YFiler sampling array to complete this investigation. In an era where commercial testees can enjoy upwards of 111 Y-STR's, the long-term usefulness of this paper's extensive worldwide sampling is cut short. Another recent paper presenting Y-STR's worldwide has done so using 23 rather than just 17. [2]

My comments are more critical of the authors' sampling strategy. More data is never strictly a burden in the world of population genetics, but the informativeness of groups such as "European Americans", "Irish" and Chinese born in the USA is questionable. For instance, these groups are already richly represented, be it in the current literature or FTDNA Project groups. The apparent issue with these samples would have been rectified if they were simply obtained from a single area, providing regional specificity which may prove useful in better establishing genetic variation within Ireland, for example.

Finally, the haplotypes could have also received a "backbone" SNP test each to definitively place them within the current phylogeny. The drawbacks of STR-alone testing became readily apparent with some of the African samples. I can only speculate it is the highly divergent nature of certain uniquely African haplotypes from Eurasian ones which produced these spurious results.

On Mutation Rates (Quick Discussion)
In this study, both BATWING and the average squared distance (ASD) method were used. Within each, four different mutation rates were implemented. On initial inspection these appear to vary wildly. However, on closer examination, it appears all the BATWING most recent common ancestor (MRCA) calculated ages are approximately twice as old as those generated by the ASD method. Even within each technique there is substantial variation; the evolutionary rate appears approximately three times greater than the others. Furthermore, these "other" mutation rates do tend to congregate around a common similar value (e.g. through BATWING, the calculated global age of their Y-DNA R-M198 haplotypes was 5.5k, 6.1k and 6.2kya), which would intuitively suggest the "actual" value lies somewhere within these either through BATWING or ASD. The discrepancy here cannot be overstated and calls into question why some researchers are still utilising a "blanket" mutation rate across several loci which are shown to have significantly different tendencies to mutate (colloquially described as "slow", "medium" and "fast" mutators). I am uncertain whether the authors are in fact doing this, but the implications of this are apparent, as they prevent rational "fitting" of these numbers into candidate prehistoric narratives from happening. This entire topic will likely be explored in a future entry.

Although at least three drawbacks (four including the MRCA calculations) are identified here, this study provides researchers worldwide with a plethora of data from populations that are either poorly represented in the current literature or have been entirely absent until present. The majority of the results outline the wide Y-chromosomal diversity across the world, whilst also revealing specific trends that have been established in both the current literature and in online discussion boards. An mtDNA counterpart of this paper would be a wonderful addition to see sometime in the near future.

There is a bountiful amount of data to be interpreted with pre-existing ideas/models and compared with prior studies which place a premium on each population's area. I welcome any form of dialogue regarding the results. There, is, for many of us, plenty to elucidate. The conclusion does not end here; I encourage as much further investigation and thought by the readers as the data permits.

[Addendum @ 05/09/2014]: Error regarding Karitiana data. Modified and updated.

1. Karafet TM, Mendez FL, Sudoyo H, Lansing JS, Hammer MF. Improved phylogenetic resolution and rapid diversification of Y-chromosome haplogroup K-M526 in Southeast Asia. [Last Retrieved 03/09/2014]: 

2. Purps J, Siegert S, Willuweit S, Nagy M, Alves C, Salazar R et al. A global analysis of Y-chromosomal haplotype diversity for 23 STR loci. [Last Retrieved 05/09/2014]:

Wednesday, August 6, 2014

Anchored in Armenia: An Exercise in Genetic Relativity [Original Work]


Location of the Armenian Highlands in West Asia
As is the case with many groups in the region, the Armenians are, anthropologically-speaking, a very unique modern ethnicity. Situated in the Armenian Highlands (an expansive area straddling between the Zagros & Caucasus range) with a settlement history dating since the Neolithic, the modern Armenian people have maintained a distinct culture both shaped and shielded by the mountainous territory they inhabit. [1] One unique aspect of the Armenian people is their language; Modern Armenian is an Indo-European language belonging to its' own branch. There has long been scholarly debate regarding its' linguistic exodus from the Proto-Indo-European homeland (commonly accepted by modern linguists as the Pontic-Caspian steppe) [2] through to its' historical seat in the South Caucasus. As is evident by the attested Urartian and Hurrian loanwords in later forms of the language, Armenian must have been spoken by its' current forebears since at least before 500 B.C. [3] Various genetics enthusiasts (including myself) on differing occasions have cited this as an indication of an aboriginal West Asian genetic layer accompanying the Urartian-Hurrian vocabulary substratum.

Presumably due to the on-going political instability in West Asia, there has been an unfortunate lack of ancient DNA (aDNA) recovery in the areas adjacent to the Armenian Highlands. Alongside the Armenians, West Asia proper is also home to Anatolian Turks, numerous Kurdish groups, the Assyrians, several Jewish minorities and various ethnic groups within Iran. Inter-relation of all these groups in differing extents has been demonstrated in both published studies [4] and the open-source projects. [5,6]

Mount Ararat - A symbolic item in Armenian culture
Although they have most likely experienced their own demic events in prehistoric times, the insular nature of the Armenians relative to their neighbours allows them to be used as a stand-in for the aDNA we currently lack in this part of the world. In this blog entry, the Armenians will therefore be considered as a surrogate for autochthonous West Asian ancestry. They will be treated as a primary donor population (PDP) for several other West Asian groups, in an attempt to flesh out the degree of mutual shared ancestry, as well as the directions of added affinities beyond the region. This is by no means an authoritative attempt to purport a particular image of the West Asian genetic landscape, but an attempt instead to provoke discussion and explore the underlying structure of the region through a manner that should hopefully yield fruitful results in the glaring absence of aDNA in the region.

Working Hypotheses

1. Given the demonstrated similarity in autosomal DNA profiles (here and here), modern Armenians will serve as a reasonable PDP for all tested populations.

2. Furthermore, the genetic difference (GD) will likely be dictated by geographical proximity to the Armenians, or a (lack of) history of admixture with them.

3. Finally, the other donor populations will be anticipated either by virtue of geography or language.


The Dodecad K12b Oracle was used to undertake this small project (please visit link for technical information). When executed through R, the program was set to Mixed Mode and fixed to 500 results for every iteration per population. The command entered therefore remained the same each time:


Samples consist of nine location-specific populations (Iranians, Kurds_Y, Azerbaijan_Jews, Iraq_Jews, Iran_Jews, Turks, Turks_Aydin*, Turks_Kayseri*, Turks_Istanbul*) and four Dodecad participant averages (Iranian_D, Kurd_D, Assyrian_D, Turkish_D). A total of thirteen populations were therefore included.

From the output, only those combinations expressing an Armenian population as a PDP were selected. In this context, the Armenians will be considered a PDP if their "ancestral" percentage exceeds 50%. A maximum of ten were collected per population. In the event the number of combinations exceeded this, the subsequent combination lists are terminated with an ellipsis.

* Although not included in the original Dodecad K12b Oracle dataset, Dienekes has conveniently shared the population averages for these samples here. These were manually inserted into the command.


Iranian and Kurdish Oracle results
Unsurprisingly, the Iranians and Kurds all display similar results. Specifically, the adoption of either Makrani or Balochi as the secondary donors when Armenians are fixed as a PDP. The proportions are also comparable between all. The Iranians appear to fit the Armenian + Balochi/Makrani combination slightly better than the Kurds (GD=4.04-5.16 vs. 5.03-6.65 to 2 d.p. respectively). It is also worth observing that both Iranians and Kurds, irrespective of sampling strategy (location-specific or Dodecad average), do not have Mixed Mode results which exceed ten.

Assyrian and select Near-Eastern Jewish Oracle results
The Assyrians are one of the groups of interest, given the demonstrated autosomal similarity between them and Armenians (here). As anticipated, their Mixed Mode results well exceed ten and the best fits (GD=1.66-1.82 to 2 d.p.) are all, coincidentally, with the Near-Eastern Jewish groups studied here. Subsequent matches include additional populations (e.g. Saudi, Bedouin, Syrian) where the GD remains relatively small compared to the Iranian and Kurdish values (>3.15 to 2 d.p.).

The Near-Eastern Jewish groups largely mirror the Assyrian results, although some key differences should be outlined:

  • The Azerbaijani Jews have a GD similar to the Assyrians in range, setting them apart from the Iraqi and Iranian Jews. This seems to fit geography. However, if the association was strictly geographical, one would expect the Assyrians to lie in-between the Azerbaijani Jews from the Iraqi and Iranians. This may be genetic evidence of additional and direct ancestry between Armenians and Assyrians at some (or various) point(s) after the Near-Eastern Jewish groups had formalised their identities.
  • Saudis appear as a secondary donor population in all groups. Interestingly, they appear to have an inverse relationship with geographic proximity to the Armenian Highlands; Iraqi, Iranian and Azerbaijani Jews are 20.4%, 16.1% and 7.8% "Saudi" respectively. The Assyrians too fall on this cline despite the point raised above.

Anatolian Turkish Oracle results
Finally, the Anatolian Turks provide us with another set of interesting values and pairs:

  • Mixed Mode results from Western Turkey (Aydin, Istanbul) largely exhibit a combination of Armenian with various European ethnic groups or nationalities, which can be predominantly ascribed to geography. Please note the comparatively large GD among the Aydin average (>9.93 to 2 d.p.), which contrasts with Istanbul. I suspect the cosmopolitan nature of Istanbul has resulted in an artefactual lowering of the GD, given Anatolian Turks from
    across the country have moved their for employment purposes. [7]
  • In contrast, the samples listed as "Turks" in Dodecad K12b (from the Behar et al. dataset, located in Central-South Turkey) model well as a combination of Armenian with either the Chuvash, Nogay, Uzbek or Uyghur. European secondary donors do make an appearance once more. Please also note their GD is the smallest out of the Turkish averages investigated (4.20 to 2 d.p.).
  • The Kayseri average (Central Turkey) yielded no results matching the criteria outlined in "Method". However, the Assyrians instead made a frequent appearance as primary donors from GD=6.17 onwards. Given the genetic affinity between Assyrians and Armenians (refer above), and the consistency displayed by the Armenians as a PDP for other Turkish averages, this result can be considered anomalous. A close inspection of the Dodecad K12b proportions reveals the Kayseri Turks were on average approximately 1.5% more Southwest Asian than all other Turkish populations, explaining why Assyrians took preferential placing over Armenians as the PDP. The cause of this slight increase is unknown at present.
  • The Turkish_D average best resembled that of Istanbul, albeit with slightly more Armenian and less European proportions. This would suggest that, overall, the Dodecad Turkish participants map somewhere just east of Istanbul despite the presumably diverse backgrounds. 
  • Finally, all averages produced Mixed Mode results which exceeded ten in number.

IBD Segment Indications

To corroborate the findings of this investigation with additional genetic data, I refer to the Dodecad Project's fastIBD analysis of Italy/Balkans/Anatolia and fastIBD analysis of several Jewish and non-Jewish groups. As the analyses do not completely encompass those groups studied here, the results cannot be accepted wholesale. However, there does appear to be a broad agreement with some of the results in this investigation. For example, the Armenians and Assyrians have a demonstrated level of "warmth" to one another beyond background sharing.

Further Work

This investigation would have benefited from Azeri Turkish samples via the Republic of Azerbaijan. Additionally, a better breakdown of Kurdish, Iranian and Assyrian samples, akin to the site-specific sampling seen here in the Anatolian Turks, would have been ideal. Finally, as stated above, this investigation would have benefited from the inclusion of IBD segment analysis specific to the studied groups. Should time permit and the desired samples be made available in the future, this would be a natural line of inquiry to further what has been explored here.


Addressing the three hypotheses stated at the beginning in order:

1. Armenians certainly have behaved as a reasonable proxy for an autochthonous West Asian PDP in most of the populations tested (sole exception being the Kayseri Turks although this appears to be an anomalous response to slightly more Southwest Asian scores). The scores vary depending on the presence of the secondary donors, but Assyrians and Jewish populations from Azerbaijan, Iran and Iraq appear to have the largest proportion of this (occasionally surpassing 90%). All Iranians and Kurds, on the other hand, scored the least overall (approximately 65-75%). The Turkish range lies in-between these two.

2. Unfortunately, this isn't clear. The lack of regional results for Kurds and Iranians, together with a lack of samples specifically from Eastern Turkey, prevents any conclusion being reached on this point. The Near-Eastern Jewish populations studied here certainly do form a cline of Armenian "admixture" that is fully in line with geography. Furthermore, the large GD observed in Aydin Turks does support this idea, leading me to cautiously propose geography does indeed play a role. The second point also provides us with a partial answer, as the Assyrians demonstrate more of this than one would expect given their geographical placement based on GD, as well as fastIBD evidence from elsewhere.

3. With the exception of the Assyrians and Near-Eastern Jewish groups, the secondary donors overwhelmingly matched my expectations regarding their placement with whichever group that was studied (e.g. Iranians and Kurds towards South-Central Asia, Turks towards either Europe or Central Asia proper).

Over the coming years, with the availability of more data, we should hopefully move away from the population averages that have been used by various open-source projects. It has been empirically demonstrated here that regional results will differ significantly from nationwide averages (e.g. Aydin Turks vs. Turkish_D).

This also holds true on an individual basis; the best Oracle match for one Iranian via the described methodology was 56.4% Armenians_15_Y + 43.6% Tajiks_Y (GD=5.44 to 2 d.p.), differing significantly from both the Iranian and Kurdish averages.

I suspect the gentlemen running the numerous open-source projects are aware of this caveat and are, justifiably so in my opinion, making do with currently available data.

In closing, this investigation has also determined that, on the basis of the presumption of an Armenian-like autochthonous West Asian substrate, the studied populations as a whole have an apparent degree of inter-relatedness by virtue of this common South Caucasian autosomal heritage, albeit with the presence of highly significant affinities to elsewhere in Eurasia, be it population-wide, regional or even individual.


The first topic is regarding the Iranians and Kurds; why were their average secondary donors always the Balochi's and Makrani, rather than more northern groups, such as the Tajiks? I suspect, when applied to population averages, the Oracle program effectively minimises intra-population variation to the point where only the broadest of affinities are indicated. In the case of Iranians, the secondary donor would therefore be one with genetic features that tend to emphasise the difference between Armenians and Iranians (e.g. additional South Asian and Gedrosian admixture). A similar conclusion can be reached with respect to the Turks.

Another interesting point is the demonstrated close relationship between the Assyrians and various Near-Eastern Jewish groups. This has been speculated upon in various discussion forums in the past. More precise tools will be required to elucidate whether these populations share legitimate ancestry with one another, or the affinity is happen-stance, instead reflecting the mixture of similar Near-Eastern groups with (again) similar Caucasus-derived groups at some point in history.

[Addendum I, 07/08/2014]: For a continuation on this with a fellow genome blogger, please read the Comments below.


Full credit for both the generation of raw population data and the Oracle program go to Dienekes Pontikos (Dodecad Ancestry Project).

Map of Armenian Highlands from Photo of Mount Ararat courtesy of

Finally, I must refer all visitors interested in understanding the genetic constituency of the Armenian people to the FTDNA Armenian DNA Project. For a more interactive learning experience, two of the administrators (Mr.'s Simonian and Hrechdakian) recently delivered a lecture on this topic, garnishing it with a deeper description of anthropological and geographical aspects as described here.


1. Samuelian TJ. Armenian Origins: An Overview of Ancient and Modern Sources and Theories. [Last Accessed 3/08/2014]:

2. Clackson J. Indo-European Linguistics: An Introduction. Cambridge Textbooks in Linguistics [Last Accessed 4/08/2014]:

3. Greppin JAC. The Urartian Substratum in Armenian. [Last Accessed 4/08/2014]:

4. Grugni V, Battaglia V, Hooshiar Kashani B, Parolo S, Al-Zahery N et al. Ancient migratory events in the Middle East: new clues from the Y-chromosome variation of modern Iranians. PLoS One. 2012;7(7):e41252.

5. Dodecad Ancestry Project: ChromoPainter/fineSTRUCTURE Analysis of Balkans/West Asia [Last Accessed 4/08/2014]:

6. Eurogenes Genetic Ancestry Project: Updated Eurogenes K13 and K15 population averages [Last Accessed 4/08/2014]:

7. Filiztekin A, Gokhan A. The Determinants of Internal Migration In Turkey. [Last Accessed 05/08/2014]: