Thursday, March 29, 2012

Showcasing of Y-DNA Variation Among Afghan Ethnic Groups

This very recent paper on Afghan Y-Chromosomes was released by M Haber et al. and provides us with an insight into the paternally-determined genetic structure of several Afghan populations.

Afghanistan's Ethnic Groups Share a Y-Chromosomal Heritage Structured by Historical Events
Haber M, Platt DE, Ashrafian Bonab M, Youhanna SC, Soria-Hernanz DF, et al. (2012) Afghanistan's Ethnic Groups Share a Y-Chromosomal Heritage Structured by Historical Events. PLoS ONE 7(3): e34288. doi:10.1371/journal.pone.0034288

"Afghanistan has held a strategic position throughout history. It has been inhabited since the Paleolithic and later became a crossroad for expanding civilizations and empires. Afghanistan's location, history, and diverse ethnic groups present a unique opportunity to explore how nations and ethnic groups emerged, and how major cultural evolutions and technological developments in human history have influenced modern population structures. In this study we have analyzed, for the first time, the four major ethnic groups in present-day Afghanistan: Hazara, Pashtun, Tajik, and Uzbek, using 52 binary markers and 19 short tandem repeats on the non-recombinant segment of the Y-chromosome. A total of 204 Afghan samples were investigated along with more than 8,500 samples from surrounding populations important to Afghanistan's history through migrations and conquests, including Iranians, Greeks, Indians, Middle Easterners, East Europeans, and East Asians. Our results suggest that all current Afghans largely share a heritage derived from a common unstructured ancestral population that could have emerged during the Neolithic revolution and the formation of the first farming communities. Our results also indicate that inter-Afghan differentiation started during the Bronze Age, probably driven by the formation of the first civilizations in the region. Later migrations and invasions into the region have been assimilated differentially among the ethnic groups, increasing inter-population genetic differences, and giving the Afghans a unique genetic diversity in Central Asia."

[PDF] [Supplementary Data]

Tabulated Y-DNA Haplogroup frequencies of the 204 individuals sampled distinguished by ethno-linguistic affiliation (ISOGG 2011 Nomenclature utilised) can be found in the Data Sink.

Results (populations sample count ~50 only)

- Haplogroup B-M60, a marker that would normally be expected among African populations, makes a surprising presence in the Afghan Hazara. Superficial STR analysis (17/19 haplotype match between all) suggests a recent common paternal ancestor, although the timeframe and ultimate origin of this common ancestor is another question.

- Haplogroup C3-M217 has invariably been associated with the expansion of Altaic-/Mongolic steppe populations since medieval times. The greater frequency (33.9%) in the Hazara relative to the Tajiks and Pashtuns appears to support this, as well as the commonly-held belief they partially descend from Mongolian tribes.

- The Hazara E1b1b1c1-M34 also stems from a common ancestor (all three share the exact 19 STR haplotype).

- The single man belonging to Haplogroup G1-M285 is of Tajik descent. It is possible this man's paternal line arrived with eastward migrating Persians following the Sassanid collapse in 651 A.D.

- As shown in previous studies, the Pashtun Haplogroup G men are again G2c-M377 (entirely this time, in contrast with Lacau et al.)

- Paragroups H*-M69, J2a*-M410, Q*-M242 and R*-M207 all indicate that Afghanistan played an important role in the demic development of their downstream subclades, or was at the very least a geographic nexus. It is worth noting that the Hazara Q* men belong to a different haplotype to their Pashtun and Tajik compatriots, again indicating genetic drift has taken place since the formation of the Hazara ethnic group (or, instead, paternal consistency through the presumed Mongolic layer that eventually formed modern Hazaras).

- In previous studies (Sengupta et al., Lacau et al.), several haplotypes without backbone SNP testing were found to belong to Haplogroup I, which is frequently considered a lineage specific to Europe. For the first time we have evidence of an I clade (I2b1-M223) in South-Central Asia, specifically among the Hazara and Tajik. The following is a recent exchange with Professor Ken Nordtvedt regarding the I2b1-M223 samples;

"The two Hazara seem related.  Both haplotypes look like M223+, with the Tajik one like Continental2 characteristic of central Europe.
The Hazara haplotype looks more like M223+ Roots.  But both have some problems with being considered close matches to European haplotypes.
I don’t think such tmrcas would be worth much.  I still don’t have a firm subclade of M223 to work with for either haplotype."

Due to the limited STR's it is not possible to cleanly place these I2b1 haplotypes into any of the existing clusters/subclades. However, Haplogroup I2b itself appears to be thousands of years old (Nordtvedt's I tree, final page). This opens up the possibility for an endogenous form of Haplogroup I existing in South-Central Asia.

- A single Tajik belonging to J1c3-P58 was postulated to potentially be of Arabian origin. As the (miniscule) Afghani Arabs did not yield any J1c3, other possibilities should be considered, such as contacts with the Iranian plateau over the past few millenia.

- The Tajiks were the only population to boast the presence of all major subclades within Haplogroup L (L1a-M27, L1b-M317, L1c-M357). In line with their greater frequency relative to the Tajiks and Hazaras, several Pashtun L1c-M357 samples share similar (exact-to-2-step mutation) matches, suggesting another example of genetic drift.

- Although the Laghman Pashtuns share a similar L1c-M357 haplotype (16-17/19 match), so does the sole Tajik L1c from the same location, providing us with genetic evidence of recent mutual origins between Pashtuns and Tajiks in certain parts of Afghanistan.

- The Tajik population is more paternally diverse than all others sampled. Explanations include a less endogamous cultural character or the more recent imposition of the "Tajik" identity, which arrived with the medieval Turks.

- R1b1a*-P297 (xM269) and R1b1a2*-M269 (xU106) both appear in Uzbek and Tajik populations. Both the R1b1a*-P297 haplotypes are identical and belong to a Tajik and Uzbek, again showing there is some recent paternal overlap between Central Asian ethnic groups. I discovered the haplotype does not generally correspond with any of the established clusters in the R1b1a1-M73 Project, although there is a 13/15 match with a Tajik from Cluster B1. Although the limited STR's are unfavourable, I am of the opinion the match is substantial and the R1b1a*-P297 reported in this study is in fact R1b1a1-M73 and belongs to Cluster B1, whose membership also consists of other Tajiks, Uzbeks and an Anatolian Turk.

- It is very interesting to note that all the locations showing R1b1a*-P297 (xM269) and R1b1a2*-M269 (xU106) (Badakhshan, Herat, Takhar and Mazar-e-Sharif) lie on a horizonal plane that runs across the north of Afghanistan, particularly as the Bactria-Margiana Archaeological Complex (BMAC) was situated here.

Criticisms of Paper

- Haplogroup R2a-M124 has been erroneously correlated with aboriginal Subcontinental populations when results from the R2 WTY Project indicate places like India are a "sink" rather than a "source" (most Indian R2a is R2a1-L295, which has a spotty distribution across the rest of Eurasia).

- Haplogroup L is, much like R2a, an understudied lineage, presumably due to its' paucity in Europe. The once-common assumption in the population genetics and genealogical world that the frequency of a given lineage in a region/population signifies its' antiquity there has been proven to be inherently false through STR and SNP analysis. Haplogroup L may enjoy greater frequencies in India according to the sources at their disposal, but the presence of different L subclades in Central and West Asia should have at least given the authors the initiative to investigate the lineage's deeper structure rather than relying on a population genetics tagline from at least 2006 (Sengupta et al.).

- Despite the recent boon in research on Haplogroup R1a1a-M17's structure by independent genetic genealogists and projects (such as the R1a1a and Subclades Y-DNA Project), Haber et al. failed to include any of the pivotal SNP's that have been discovered since Underhill et al. from 2009, thus preventing observers from making any meaningful conclusions from the current findings, particularly in the context of the Indo-European migrations (generally accepted from the Eurasian steppes).

- When divided into ethno-linguistic lines, this study showcases 3 Arabs, 13 Balochis, 59 Hazaras, 5 Nurestanis, 49 Pashtuns, 56 Tajiks, 1 Turkmen and 17 Uzbeks. The most immediate criticism is inadequate testing of the Arabs, Nurestanis, Balochis, Turkmens and Uzbeks in particular.


Despite several glaring flaws in methodology, Haber et al. has provided us with a much-needed insight into the deeper genetic structure of Afghanistan's Y-Chromosome diversity. There is clear evidence of genetic drift (particularly among the Pashtun Q*-M242/L1c-M357 or Hazara C3-M277), as well as evidence of recent line sharing between populations (The situation of L1c-M357 in Laghman).

However, Haber et al. has thrown out some very interesting surprises (T1-M70 among Tajiks only) as well as validating results from previous studies that had previously been questioned (I2b1-M223 and R1b1a2-M269 particularly). How did these lineages arrive in Central Asia? Is recent colonial admixture a possibility? For the time being, we will have to contend with this questions steadfastly.

Addenum I [30/03/2012]: Determination of R1b1a*-P297 furthered with regard to it potentially being R1b1a1-M73.
Addenum II [30/03/2012]: Insertion of Nordtvedt correspondence.


  1. Excellent assessment. What's interesting here is that the y-Chrosomal haplogroup frequencies for the Pashtuns from this study are quite comparable to the results for northern Pakistani Pashtuns (Pathans) from the Kurram Valley, FATA (see coordinates here) from previous studies.

    Polarity and Temporality of High-Resolution Y-Chromosome Distributions in India Identify Both Indigenous and Exogenous Expansions and Reveal Minor Genetic Influence of Central Asian Pastoralists - Sengupta et al. (2006-07);

    Pathan (North Pakistan, Indo-European speaking, n=20) -
    1/20 = 5% C3-M217
    1/20 = 5% G2a-P15
    1/20 = 5% G2c-M377
    1/20 = 5% H1*-M52
    1/20 = 5% H*-M69
    1/20 = 5% L1-M76
    1/20 = 5% L3-M357
    2/20 = 10% Q1a3-M346
    8/20 = 40% R1a1-M17
    2/20 = 10% R1b1b2-M269
    1/20 = 5% R*-M207

    Y-chromosomal evidence for a limited Greek contribution to the Pathan population of Pakistan - Firasat et al. (2007)

    Pathan (North Pakistan, Indo-European speaking, n=96) -
    E3b1-M78 - 2.1% (2/96)
    F-M89 (xM201, M52, Apt, M170, 12f2, M9)- 2.1% (2/96)
    G-M201 - 11.5% (11/96)
    H1-M52 - 4.2% (4/96)
    J1-M267 - 1.0% (1/96)
    J2-M172(xM92) - 5.2% (5/96)
    K2-M70 - 1.0% (1/96)
    L1-M27 - 5.2% (5/96)
    L3-M357(xPK3) - 7.3% (7/96)
    O2a1a-PK4 - 4.2% (4/96)
    O3-M122(xL1Y) - 1.0% (1/96)
    Q-M242 - 5.2% (5/96)
    R-M207(xM173, M124) - 1.0% (1/96)
    R1-M173(xM17) - 4.2% (4/96)
    R1a1-M17(xPK5) - 44.8% (43/96)

    Compared to Afghanistan's Ethnic Groups Share a Y-Chromosomal Heritage Structured by Historical Events - Haber et al. (2012)

    Pashtun (SE Afghanistan, Indo-European speaking, n=49)
    C3 – 2.04% (1/49)
    G2c – 6.12% (3/49)
    H* - 2.04% (1/49)
    H1a – 4.08% (2/49)
    J2a – 2.04% (1/49)
    L1c – 12.24% (6/49)
    Q* - 16.32% (8/49)
    Q1a3 – 2.04% (1/49)
    R1a1a – 51.02% (25/49)
    R2a – 2.04% (1/49)

    I can't help but wonder, in light of this y-DNA similarity, whether this would translate into a reasonably similar autosomal DNA similarity between the Pashtuns of the two countries.

  2. There is another flaw in the study i.e., the under-testing of Pashtun and Tajak populations (Tajak: 53 samples for population proportion of 27%; Pashtun 49 samples for a population proportion of 42%) and over-testing of Hazaras(60 samples for a population proportion of 9%). Also worth-noticing is the fact that no Pashtun from the core Pashtun region like Kandahar has been tested and most of the Pashtun samples have been taken from Afghanistan's north (very few samples from Afg south where most of Pashtuns are concentrated).

  3. @Barak,

    Kandahar and Mazar-e-Sharif were already heavily sampled by Lacauab et al.; whether or not this was the intention of the authors, Haber et al.'s sampling of the following locations has given us a country-wide perspective on Afghan Pashtun Y-DNA;

    Kabul, Kabul
    Wardak, Maidan Shar
    Herat, Herat
    Kunduz, Kunduz
    Nangarhar, Jalalabad
    Kapisa, Tagap

    I agree partially on your point concerning the disproportionate testing among the ethnicities, but two other studies had been published over the past year on Afghan Pashtun Y-DNA. We are fortunate to have a wealth of much-needed haplotypes from the other ethnic groups in the country. The last time "Tajiks" had received the limelight was through Zerjal et al., which basically re-examined Dr. Wells' seminal paper on Central Asia a couple of years later.


    Whichever deep ancestral criterion we use to project autosomal DNA results from lead us to the same conclusion.

    Based on the Afghan Pashtun Y-DNA presented in this paper, they will likely turn out to be more West Eurasian than their Pakistani counterparts based on elevated frequencies of R1a1a-M17 and less/no L1-M27 and O-M175-derived subclades. Please note the inflated frequencies of L1c-M357 and Q*-M242 among the Afghan Pashtuns are the result of genetic drift (I concluded this safely with the latter but did observe greater STR variation in the former).

    Although no maternal (mtDNA) data was supplied with this paper, we can predict the Afghan Pashtuns will again turn out to be more West Eurasian based on previous studies, where geography appears to be (one of) the driving factors.

  4. So you think R1b was a BMAC ydna and present during the time the BMAC existed? Does this apply to both M269 and M73? Where does G2a (G2c/J2b) fit into this larger picture?

    Is there a chance that Turks picked up certain Wets Eurasian lineages on their way to Central Asia (so they aren't just turkified iranian lineages)?

  5. L1-M27(L1a) is West Eurasian. How does having less of it matter?

  6. newtoboard, this is the final warning regarding multi-posting. I will remove future consecutive messages without hesitation after you've made one per entry. This is a research blog and not a forum.

    I am contemplating the possibility that the R1b1a2-M269 in Central Asia may be attributed to BMAC farmers and *happened* to not reach the same prominence as farming communities further west.

    There is indeed a chance that the early Turks picked up certain West Eurasian lineages. The anthropological data I've seen thus far suggests they were Mongoloid-Caucasoid intermediates of some flavour. Combined with the penetration of West Eurasian lineages even into Central China, I'd say it's most probable they carried West Eurasian Y-DNA/mtDNA.

    Haplogroup L is poorly researched because it does not currently fit the scope of interest currently held by academia, who are preoccupied with Y-DNA R1a, R1b, I and J at present.

  7. My bad.

    I do think they picked up certain West Eurasian lineages, I just think they picked them up in Central Asia and not elsewhere. I think every West Eurasian lineages in Central Asia was there already and Turks didn't bring any new ones. Just certain lineages lke R1b-M73 just expanded for whatever reason with Turks and increased in frequency for the same reason R1a expanded with Indo-Iranians and became so strong in places like Kyrgzstan/Afghanistan when without that ydna being selected for they would have a more even mix of Neolithic and Indo-European ydnas. I am curious if I or N ever existed in Central Asia though.

    R1b-M269 in iran is interesting too. When did it get there? I have heard everything from an origin there, to being Mesolithic, Neolithic or even being attibuted to Hurrian/Syrian/Assyrian/Armenian admixture. The R1b originated in Anatolia to me and moved east during the Neolithic theory makes most sense.