Saturday, March 31, 2012

North European Component Variation within the Eurasian Heartland

As DNA variation across Asia have progressed over the years (Wells et al., Xing et al., teaser mtDNA results from Burger et al.'s upcoming analysis of prehistoric Eurasian steppe remains), the prevailing theme of ancestral markers with origins in Europe has remained a frequent one, particularly with regard to the expansion of Bronze Age semi-pastoral nomads from the Pontic-Caspian steppe bearing the Indo-European languages.

David W. of the Eurogenes Genetic Ancestry Project has recently posted data online from a new Intra-European run using ADMIXTURE (K=12) with the intention of breaking up the North European component that often arises through the program. Spreadsheet results here.

This brief investigation seeks to identify the North European-derived component patterns within Asia by first mapping out the frequencies and then correlating with Eurogenes' release notes on each.


As many samples from immediately-identifiable populations were obtained from the spreadsheet results (link above). No sample restrictions were implemented. Averages of each population were calculated, except where n=1. No modifications made to population labels except for Eurogenes population averages, denoted by the addition of a _Eg suffix. Populations were then allocated into arbitrary regional groups, allowing results to be displayed more coherently.


Tabulated results can be found in the Data Sink. Autosomal variation per Regional Group can be found below:

The North European-derived components, despite their exceptionally close Fst. distances relative to the other components, do seem to reveal a few interesting trends;

  • Northeast European appears to (at least partially) be the result of allele sharing with populations further east, as evidenced by its' predominance in East-Central Asian groups, as well as extending even further eastwards into the Siberian Selkup (n=1). This component has a circumstantial correlation with the craniometric and ancient mtDNA evidence suggestive of a "migration corridor" between Eastern Europe and Siberia (Malyarchuk et al.'s On the Origin of Mongoloid Component in the Mitochondrial Gene Pool of Slavs, Newton's Ancient Mitochondrial DNA From Pre-historic Southeastern Europe: The Presence of East Eurasian Haplogroups Provides Evidence of Interactions with South Siberians Across the Central Asian Steppe Belt). While it also explains this component's abundance in North Caucasian populations (lie en route between Ukraine and Siberia), the same cannot be said with absolute certainty of South-Central Asia. With that being said, the 0.021 Fst distance with West European despite the markedly different distributions suggests both are the result of prehistoric (possibly paleolithic?) hunter-gatherer migration paths across large swathes of Eurasia.
  • West European has a sporadic appearance across with an Asian peak in the North Caucasus. This implies - Staying true to its' assigned label - It is a generic West Eurasian component that has reached a maximum in Western Europe, with the North Caucasus representing the closest point of reference to there. Indeed, this inference is made independently by Eurogenes, albeit using different parameters;
"I used samples of Scottish, Irish and Western English ancestry to create this cluster. Not surprisingly, it peaks in individuals of Western Irish descent. However, it also peaks in Basques and many Iberians, which is fascinating, because that makes it the autosomal equivalent of Y-chromosome haplgroup R1b in Europe."
  • North Sea and South Baltic accompany one another at similar frequencies across much of Asia, especially in populations with an Indo-Iranian-speaking heritage (observe the ~0.8-1:1 ratio among Kurds, Iranians, the Turkmen, Uzbeks, Tajiks, Brahmins, Kshatriya's and Kyrgyz as examples of this). It is interesting to note that, of the two, only the North Sea component is readily present in East-Central Asians. The only other likely migration path along this trajectory is that of the proto-Tocharians, who (under the Eurasian steppe theory) split off from the Proto-Indo-European homeland several millennia prior to the Proto-Indo-Iranians that eventually formed the Andronovo archaeological horizon from Sintashta/Pit Grave (E Kuz'mina, The Origin of the Indo-Iranians, pg.451). Perhaps this near-solitary North Sea component within the Altaians, Mongolians and Uyghurs is attributed to early speakers of Tocharian? Perhaps the elevated presence of the North Sea component in South-Central Asia (Jatts, Pathans, Kyrgyz) is a relic of the Kushans, nomads supposedly a part of the Yuezhi confederacy, who may have been Tocharian speakers themselves? 
  • One curious phenomenon is the similar West European-North Sea-Northeast European component proportions across the Turkmen, Uzbeks, Kyrgyz, Pathans, Uttar Pradesh Brahmins, Altaians and the Uyghur. Whether this can be substantiated in any way, or whether it is simply an anomalous association predicated by non-uniform and varying sample sizes, prevents a firm conclusion from being made.
  • North European-derived frequencies among Southwest Asian Semitic-speaking groups shown here seldom exceed 1% apiece and are either the result of recent, inconsistent small-scale admixture events or are simply background noise generated by ADMIXTURE.

The Northeast European and West European components appear to have a distribution independent of any significant migration events since the Neolithic, instead being associated with either the "migration corridor" across Eurasia or simply being the result of mutual West Eurasian heritage. North Sea and South Baltic, on the other hand, do seem to correlate with one another and support (rather than contradict) the eastward movement of Bronze age semi-pastoral nomads speaking early dialects of Proto-Indo-European.

Edit I [31/03/2012]: Correction of erroneous Brahmin results due to Google Spreadsheet lag.

Thursday, March 29, 2012

Showcasing of Y-DNA Variation Among Afghan Ethnic Groups

This very recent paper on Afghan Y-Chromosomes was released by M Haber et al. and provides us with an insight into the paternally-determined genetic structure of several Afghan populations.

Afghanistan's Ethnic Groups Share a Y-Chromosomal Heritage Structured by Historical Events
Haber M, Platt DE, Ashrafian Bonab M, Youhanna SC, Soria-Hernanz DF, et al. (2012) Afghanistan's Ethnic Groups Share a Y-Chromosomal Heritage Structured by Historical Events. PLoS ONE 7(3): e34288. doi:10.1371/journal.pone.0034288

"Afghanistan has held a strategic position throughout history. It has been inhabited since the Paleolithic and later became a crossroad for expanding civilizations and empires. Afghanistan's location, history, and diverse ethnic groups present a unique opportunity to explore how nations and ethnic groups emerged, and how major cultural evolutions and technological developments in human history have influenced modern population structures. In this study we have analyzed, for the first time, the four major ethnic groups in present-day Afghanistan: Hazara, Pashtun, Tajik, and Uzbek, using 52 binary markers and 19 short tandem repeats on the non-recombinant segment of the Y-chromosome. A total of 204 Afghan samples were investigated along with more than 8,500 samples from surrounding populations important to Afghanistan's history through migrations and conquests, including Iranians, Greeks, Indians, Middle Easterners, East Europeans, and East Asians. Our results suggest that all current Afghans largely share a heritage derived from a common unstructured ancestral population that could have emerged during the Neolithic revolution and the formation of the first farming communities. Our results also indicate that inter-Afghan differentiation started during the Bronze Age, probably driven by the formation of the first civilizations in the region. Later migrations and invasions into the region have been assimilated differentially among the ethnic groups, increasing inter-population genetic differences, and giving the Afghans a unique genetic diversity in Central Asia."

[PDF] [Supplementary Data]

Tabulated Y-DNA Haplogroup frequencies of the 204 individuals sampled distinguished by ethno-linguistic affiliation (ISOGG 2011 Nomenclature utilised) can be found in the Data Sink.

Results (populations sample count ~50 only)

- Haplogroup B-M60, a marker that would normally be expected among African populations, makes a surprising presence in the Afghan Hazara. Superficial STR analysis (17/19 haplotype match between all) suggests a recent common paternal ancestor, although the timeframe and ultimate origin of this common ancestor is another question.

- Haplogroup C3-M217 has invariably been associated with the expansion of Altaic-/Mongolic steppe populations since medieval times. The greater frequency (33.9%) in the Hazara relative to the Tajiks and Pashtuns appears to support this, as well as the commonly-held belief they partially descend from Mongolian tribes.

- The Hazara E1b1b1c1-M34 also stems from a common ancestor (all three share the exact 19 STR haplotype).

- The single man belonging to Haplogroup G1-M285 is of Tajik descent. It is possible this man's paternal line arrived with eastward migrating Persians following the Sassanid collapse in 651 A.D.

- As shown in previous studies, the Pashtun Haplogroup G men are again G2c-M377 (entirely this time, in contrast with Lacau et al.)

- Paragroups H*-M69, J2a*-M410, Q*-M242 and R*-M207 all indicate that Afghanistan played an important role in the demic development of their downstream subclades, or was at the very least a geographic nexus. It is worth noting that the Hazara Q* men belong to a different haplotype to their Pashtun and Tajik compatriots, again indicating genetic drift has taken place since the formation of the Hazara ethnic group (or, instead, paternal consistency through the presumed Mongolic layer that eventually formed modern Hazaras).

- In previous studies (Sengupta et al., Lacau et al.), several haplotypes without backbone SNP testing were found to belong to Haplogroup I, which is frequently considered a lineage specific to Europe. For the first time we have evidence of an I clade (I2b1-M223) in South-Central Asia, specifically among the Hazara and Tajik. The following is a recent exchange with Professor Ken Nordtvedt regarding the I2b1-M223 samples;

"The two Hazara seem related.  Both haplotypes look like M223+, with the Tajik one like Continental2 characteristic of central Europe.
The Hazara haplotype looks more like M223+ Roots.  But both have some problems with being considered close matches to European haplotypes.
I don’t think such tmrcas would be worth much.  I still don’t have a firm subclade of M223 to work with for either haplotype."

Due to the limited STR's it is not possible to cleanly place these I2b1 haplotypes into any of the existing clusters/subclades. However, Haplogroup I2b itself appears to be thousands of years old (Nordtvedt's I tree, final page). This opens up the possibility for an endogenous form of Haplogroup I existing in South-Central Asia.

- A single Tajik belonging to J1c3-P58 was postulated to potentially be of Arabian origin. As the (miniscule) Afghani Arabs did not yield any J1c3, other possibilities should be considered, such as contacts with the Iranian plateau over the past few millenia.

- The Tajiks were the only population to boast the presence of all major subclades within Haplogroup L (L1a-M27, L1b-M317, L1c-M357). In line with their greater frequency relative to the Tajiks and Hazaras, several Pashtun L1c-M357 samples share similar (exact-to-2-step mutation) matches, suggesting another example of genetic drift.

- Although the Laghman Pashtuns share a similar L1c-M357 haplotype (16-17/19 match), so does the sole Tajik L1c from the same location, providing us with genetic evidence of recent mutual origins between Pashtuns and Tajiks in certain parts of Afghanistan.

- The Tajik population is more paternally diverse than all others sampled. Explanations include a less endogamous cultural character or the more recent imposition of the "Tajik" identity, which arrived with the medieval Turks.

- R1b1a*-P297 (xM269) and R1b1a2*-M269 (xU106) both appear in Uzbek and Tajik populations. Both the R1b1a*-P297 haplotypes are identical and belong to a Tajik and Uzbek, again showing there is some recent paternal overlap between Central Asian ethnic groups. I discovered the haplotype does not generally correspond with any of the established clusters in the R1b1a1-M73 Project, although there is a 13/15 match with a Tajik from Cluster B1. Although the limited STR's are unfavourable, I am of the opinion the match is substantial and the R1b1a*-P297 reported in this study is in fact R1b1a1-M73 and belongs to Cluster B1, whose membership also consists of other Tajiks, Uzbeks and an Anatolian Turk.

- It is very interesting to note that all the locations showing R1b1a*-P297 (xM269) and R1b1a2*-M269 (xU106) (Badakhshan, Herat, Takhar and Mazar-e-Sharif) lie on a horizonal plane that runs across the north of Afghanistan, particularly as the Bactria-Margiana Archaeological Complex (BMAC) was situated here.

Criticisms of Paper

- Haplogroup R2a-M124 has been erroneously correlated with aboriginal Subcontinental populations when results from the R2 WTY Project indicate places like India are a "sink" rather than a "source" (most Indian R2a is R2a1-L295, which has a spotty distribution across the rest of Eurasia).

- Haplogroup L is, much like R2a, an understudied lineage, presumably due to its' paucity in Europe. The once-common assumption in the population genetics and genealogical world that the frequency of a given lineage in a region/population signifies its' antiquity there has been proven to be inherently false through STR and SNP analysis. Haplogroup L may enjoy greater frequencies in India according to the sources at their disposal, but the presence of different L subclades in Central and West Asia should have at least given the authors the initiative to investigate the lineage's deeper structure rather than relying on a population genetics tagline from at least 2006 (Sengupta et al.).

- Despite the recent boon in research on Haplogroup R1a1a-M17's structure by independent genetic genealogists and projects (such as the R1a1a and Subclades Y-DNA Project), Haber et al. failed to include any of the pivotal SNP's that have been discovered since Underhill et al. from 2009, thus preventing observers from making any meaningful conclusions from the current findings, particularly in the context of the Indo-European migrations (generally accepted from the Eurasian steppes).

- When divided into ethno-linguistic lines, this study showcases 3 Arabs, 13 Balochis, 59 Hazaras, 5 Nurestanis, 49 Pashtuns, 56 Tajiks, 1 Turkmen and 17 Uzbeks. The most immediate criticism is inadequate testing of the Arabs, Nurestanis, Balochis, Turkmens and Uzbeks in particular.


Despite several glaring flaws in methodology, Haber et al. has provided us with a much-needed insight into the deeper genetic structure of Afghanistan's Y-Chromosome diversity. There is clear evidence of genetic drift (particularly among the Pashtun Q*-M242/L1c-M357 or Hazara C3-M277), as well as evidence of recent line sharing between populations (The situation of L1c-M357 in Laghman).

However, Haber et al. has thrown out some very interesting surprises (T1-M70 among Tajiks only) as well as validating results from previous studies that had previously been questioned (I2b1-M223 and R1b1a2-M269 particularly). How did these lineages arrive in Central Asia? Is recent colonial admixture a possibility? For the time being, we will have to contend with this questions steadfastly.

Addenum I [30/03/2012]: Determination of R1b1a*-P297 furthered with regard to it potentially being R1b1a1-M73.
Addenum II [30/03/2012]: Insertion of Nordtvedt correspondence.