Sunday, August 19, 2012

Introducing the ACD Tool [Original Work]

It is with satisfaction I announce the release of my first ever population genetics spreadsheet for fellow researchers. The Ancestral Component Dissection (ACD) Tool is a piece freeware I have developed to give those with a similar knack for fiddling with ADMIXTURE, Y-SNP and mtDNA frequency data better means to flesh out inter-population differences.

ACDTool (v1.0)
How Does The ACD Tool Work?

The ACD Tool relies on the frequencies of "ancestral components", a general catch-all term for uniparental markers (Y-SNP's, mtDNA) and Autosomal DNA (auDNA). These form the mainstay of much of the work that has been done in population genetics for the past few decades. The advent of "genome blogger" projects has brought the immediacy of these techniques to those who have tested with personal genetics companies, such as Family Tree DNA (FTDNA) and 23andMe. The ACD Tool should therefore be considered a supplementary item by those interested in these results, as well as data procured from current literature.

The level of commonality that occurs between many populations and ethnic groups poses a problem for those interested in investigating what differences arise between them.

To solve this, the ACD Tool works by removing mutual shared component frequencies between sample averages within a region. The idea is to lessen the amount of regional similarity and intentionally exaggerate those differences that exist between neighbours.

This is achieved by removing congruent component values across all populations (using the lowest value as a benchmark), leaving only the differences behind.

What Experiments Are Ideal?

As the ACD Tool is intended for finer inter-population analysis, it is best applied in a regional context. It serves the purpose of better revealing genetic differences which may account for linguistic or micro-regional trends.

Example #1: Northeast Europeans (Dodecad)

Once the Polish, Russian and Finnish Dodecad cohort averages were run through the ACD Tool, I simply used Excel to create the charts. The "Before-After" feature is used to highlight that the tool has completely achieved its' desired goal in amplifying the genetic differences between them:

NE European auDNA (Dodecad) through the ACD Tool

Example #2: West Asians (Harappa)
Using the Harappa Ancestry Project this time, I ran the data of Armenians, Assyrians, Kurds and Iranians (mostly from the Harappa cohort) into the ACD Tool once more and presented the differences as above:

W Asian auDNA (Harappa) through the ACD Tool

Example #3: South-Central Asians (Eurogenes)
A final example pits Pathans, Jatts, the Burusho, Balochis and Brahuis against one another:

SC Asian auDNA (Eurogenes) through the ACD Tool

Are There Any Drawbacks?
The efficacy of the ACD Tool depends on the number of populations, cohort size and cohort specificity. As the examples above show, the level of inter-population component sharing may decrease greatly if groups that are from more genetically diverse regions are compared.

In addition, using the ACD Tool on populations that are too different (i.e. Han Chinese and Yoruba) will not work given the genetic overlap through either ADMIXTURE, Y-SNP's or mtDNA is negligible. Of course, this defeats the point of the tool in the first place.

Lastly, the tool requires Macros to be enabled for the instructions to work.


The ACD Tool is an open-source free-to-use spreadsheet. Those wishing to modify the spreadsheet for their personal use are welcome to do so. However, any modifications made to the ACD Tool with the intent of subsequent redistribution are kindly asked to contact the creator (myself) before doing so out of common courtesy.

Please also note the ACD Tool is a first attempt at giving back to the genealogy world I have been a part of for several years. Though functional (as shown above), it is not without bugs. In light of this, I am not responsible for any loss of data that may occur from its' use.

Finally, I hope the genealogy world finds some use for this nifty piece of kit.


To the Dodecad Ancestry ProjectHarappa Ancestry Project and Eurogenes Genetic Ancestry Project (auDNA used in Examples).

Addentum I [20/08/2012]: ACDTool v1.1 replaces v1.0, Macros smoothened and instructions refined. Eurogenes South-Central Asian example also added.

Saturday, August 4, 2012

West Asian Y-DNA Haplogroup Q - Turkish or Autochthonous Origins? [Original Work]

Genographic Project Y-DNA Q Migration Route

Y-DNA Haplogroup Q is defined by the M242 marker and is upstream to Haplogroup P-M45, making it the sister Haplogroup of R-M207, which populates much of West Eurasia. According to the Genographic Project, Haplogroup Q-M242 is between 15-20,000 years old, with the location invariably being placed around North Eurasia.

The frequency of Haplogroup Q largely matches the migration path outlined in the maps shown opposite. However, the presence of haplogroup Q in more southwestern portions of Asia has sparked the curiosity of genealogists and observers alike. In current literature, the presence of Haplogroup Q1a2-M25 specifically in Iran is cited as "Central Asian" influence. [1]

In an attempt to conclusively uncover the origins of Haplogroup Q-M242 in West Asia, the Y-STR haplotype variation of West, Central and South Asian Q1a-MEH2 and Q1b-M378 are visualised and analysed with genealogical tools.

The data for this investigation are gathered from various Family Tree DNA (FTDNA) projects and studies, [1,2,6-11] with the concise list shown in the References section below.

Only results presenting at least 16 Y-STR's were considered. Modifications were made as necessary on certain STR markers (particularly Y-GATA H4) to correct nomenclature differences. Urasin's YPredictor was used when Y-SNP information from studies were inadequate (e.g. no SNP's upstream of Q-M242 tested).

Samples follow a constant naming convention, with _n and _yQP_n suffixes indicating they were obtained from studies and FTDNA Projects respectively. The following populations were included;

FTDNA Y-DNA Q Migration Route
Irn = Iranian (Unspecified ethnicity), Azr_Tal = Talysh from the Republic of Azerbaijan, Trk/Tur = Anatolian Turkish, Ptn = Pashtun from Afghanistan, Ind = Indian (Unspecified ethnicity/caste), Irq = Iraqi (Unspecified ethnicity), Kzk = Kazakh, Pak = Pakistani (Unspecified ethnicity), Uzb = Uzbek, Tjk = Tajik, Haz = Hazara, Npl = Nepali, Arm = Armenian, Geo = Georgian, UAE = Emirati Arab, Irn_Arab = Iranian Arab (Khuzestan), Irn_Mzn = Iranian Mazandarani (Mazandaran), Irn_Bkt = Iranian Bakhtiari

Once collation was complete, modal haplotypes of inferred clusters were found if necessary. Additionally, clusters were inferred from haplotrees that were created. The Most Recent Common Ancestor (tMRCA) of choice clusters were calculated by comparing two modals from the first pair of intra-cluster branches. Due to the STR panels tested in the concerned papers (Y-Filer order 1) McGee's Y-Utility was the only immediately viable choice (infinite allele mutation model, 75% Probability, 25 year/generation).

Working Hypothesis
An indeterminable mix of recent (>1500ybp) and prehistoric Y-DNA Q1a-MEH2 and Q1b-M378 lines exist in the region with some instances of close haplotype sharing between West, South and Central Asia.

Limitations Of This Investigation
  • Although the number of STR panels tested has increased gradually over the past decade, 16 is not considered a "confident sell" in the genealogy world. 
  • Additionally, the difference in STR panels used meant some informative populations, such as the Makrani, Baloch, Burusho and Parsis of Pakistan were not included due to an overlap of only 12 STR's.
  • Y-STR's from several crucial populations, such as the Qashqai, Iraqi Turkoman and Azeri's from the Republic of Azerbaijan could not be found.
  • There is, of course, the great debate concerning STR mutation rates. At the time of writing I have not observed any clear consensus in the genealogy regarding this topic. The applicability of Nordtvedt's Generations series to this entry is minimal due to an STR overlap issue, hence the decision to use McGee's tool instead.
  • As discussed later, the number of Y-SNP's tested across the cited studies are insufficient to draw firm conclusions.
  • Finally, sample size is an issue. The dataset is dominated by Iranian or Afghan samples because these papers were released at times (i.e. 2008-present) where the 17 STR Y-Filer panels became mainstream. 

Y-DNA Q1a Phylogenetic Tree
Haplogroup Q1a STR Results
Four informative clusters were inferred;

  • Cluster A (DYS19=15, DYS389i=12) is largely restricted to Afghan Pashtuns, with Ptn_1-4 all sharing having a MRCA with their modal (and therefore likely founding haplotype) between 900-450 ybp. This result is consistent with the dominance of Turkic-speaking dynasties in this time period. 
  • Cluster B (DYS385a=14) has a large geographical spread from Turkey through to Iran, the United Arab Emirates, Afghanistan, Nepal and Kazakhstan. The most immediate observation is the close haplotype sharing (3-step mutation, 14/17) between Kzk_1 and Irn_4, with an estimated MRCA at 900 ybp. This result, together with the general area covered, again indicates this cluster should at the very least be broadly associated with Central Asian Turks.
  • Cluster C (DYS392=16, DYS389ii=28, DYS448=22) is interesting because its' members are exclusively Iranian and belong to Haber et al.'s Influences of history, geography, and religion on genetic structure: the Maronites in Lebanon. [2] Most of the Iranians bearing Haplogroup Q-M242 in their sample were from West Iran, where Iran's Azeri population happens to dominate the northern region. The regional exclusivity of this cluster combined with the very recent MRCA (900 ybp) lead me to suspect Haber and his associates sampled a locale in West Iran that underwent genetic drift, explaining the +10% Q-M242 that is otherwise not seen in other studies. [1] However, the MRCA too suggests these Iranian men's paternal ancestor was also associated with Medieval Turks despite the result in it's entirety not representing West Iran sufficiently.
  • Cluster D (DYS439=11, DYS437=15) mirrors Cluster B's distribution across the region but the divisions are more consistent with geography than other variables (i.e. Anatolian Turk and Armenian, Hazara together). 

Haplogroup Q1b STR Results
Five informative clusters were inferred;

Y-DNA Q1b Phylogenetic Tree
  • Cluster A (DYS385a=12, DYS439=11, DYS437=15) is, relative to the others, an early offshoot that is highly localised in South-Central Asia. 
  • Cluster B (DYS385a=14) is also localised, found specifically in Iraq and Iran.
  • Cluster C (DYS385a=14, DYS448=20) is twinned with B but appears to have a younger MRCA (925 ybp). Of interest is the wide geographic distribution across Turkey, Iran, India and Kazakhstan. Central Asian Turks once more provide a convenient historical narrative for both the predicted MRCA and spread.
  • Cluster D (DYS385a=15) is again geographically localised, this time in the greater Near-East (Turkey, Iran and Syria). 
  • Cluster E (DYS385a=12, DYS437=15) once more displays geographic localisation in South-Central Asia, specifically among Afghani Pashtuns and a FTDNA Project Pakistani.

SNP's - What Do They Tell Us?
Tabulated Y-DNA Q SNP's for select populations from several studies [1, 3-5] can be viewed in the Vaêdhya Data Sink.

There is, unfortunately, a two-pronged incompatibility issue between the Y-STR analysis and Y-SNP's provided here. Not only is there poor overlap between the populations covered in both sets, but the SNP selections in the four studies cannot do not provide us with a clear picture regarding the presence of Q*-M242(xQ1a-MEH2,xQ1b-M378) Q1a*-MEH2(xQ1a2-M25), Q1a2-M25 and Q1b-M378.

However, the distribution of Q1a3-M346 and Q1b-M378 across the Iranian plateau in contrast with the specificity of Q1a2-M25 in Azeri Iranians and Turkmen (1.6% and 42.6% respectively, although the latter is likely due to genetic drift as discussed here) suggests a strain of the first two lineages is linguistically neutral and preceded the millennia of Turkish dynastic dominance in Iran.

Fortunately, such an inference is indeed supported by the Q1a and Q1b phylogenetic trees shown in this entry. One will note (particularly with Q1b-M378) the distribution is largely geographical rather than covering large swathes of Asian land through a "recent" paternal ancestor.

A comment on Assyrian Q-M242
Although the number of STR markers tested do not allow their inclusion into this research piece, I took the liberty of comparing the sole Assyrian Y-DNA Haplogroup Q-M242 individual from the FTDNA Assyrian Heritage DNA Project to elaborate on their paternal ancestor's ultimate origins.

The Assyrian people are a Neo-Aramaic-speaking ethnic minority native to the land intersecting between Turkey, Iran and Iraq as well as the Mesopotamian basin. Modern Assyrians have (due to their Christian faith and recent historical events) practiced endogamous relationships, making them a genetically distinct group minimally affected by demic movements in the surrounding populations.

The Assyrian Y-DNA Q belongs to the Q1b1a-L245 subclade. As we have observed already, haplogroup Q1b-M378 tends to have a distribution governed more by geography with deeper cluster branches, implying greater diversification time in a given region.

At present, based on the available 10 overlapping STR's, the Assyrian Q1b1a-L245 individual matches Tur_yQP_3 best with a one-step mutation (9/10), placing them deep within Cluster C, the only one without a region-specific distribution. This preliminary evaluation indicates this Assyrian man's paternal ancestor shares Medieval genetic links with Anatolian Turkish, Iranian, Indian and Kazakh men, making a Central Asian Turkish connection likely once more.

Due to the limitations described above, the identification of clusters is more relevant based on their geographic spread. The MRCA calculations shown are simply an extremely rough estimate at the age of a cluster.

However (and fortunately once more), it is very clear that some clusters are determined by geography rather than the sort of "genealogical boon" observed in a few (e.g. Q1a Cluster C's extensive branching despite being young relative to the others).

If one takes the MRCA calculations as a very rough approximation, whilst considering a cluster's ability to supercede regional boundaries, one can estimate that 75.4% (40/53) of the Y-DNA Haplogroup Q1a-MEH2 and 31.4% (11/35) of Y-DNA Haplogroup Q1b-M378 in West, Central and South Asia can be attributed to the Turkish migrations.

In summary, Y-DNA Haplogroup Q1a-MEH2 (likely Q1a2-M25 based on anecdotal SNP evidence) is a convincing Medieval Central Asian Turkish genetic marker based specifically on its' ability to form multi-ethnic clusters in regions with a historical Turkish connection. Q1b-M378, on the other hand, generally displays enough regionalisation and cluster depth to make such an association doubtful at best, with the sole exception being those who belong to the a genetic group highlighted in this entry (Cluster C) with DYS385a=14 and DYS448=20. 

South Central Asian Q1b-M378 appears to be autochthonous whereas any form of Q1a-MEH2 in the region has a strong association with regions intimately connected with the Medieval Turks. The Anatolian highlands and the Iranian plateau, however, appear to be a complicated mix between the two based on the lack of clear distinctions.

The slim presence of Haplogroup Q in India on the other hand, as far as the current data indicates, is almost entirely of Medieval Turkic input, although the Subcontinent's position as a geographic nexus (much like Iran and Turkey) certainly open the possibility for exotic para-haplogroups to also exist there.

  • Gratitude is extended to the FTDNA Projects for making their data publicly available. Independent research ventures such as my own would not be possible without their generosity.
  • I would also like to thank Mr. Paul Givargidze, administrator of the Assyrian Heritage, Aramaic and Y-DNA J1* DNA Projects at FTDNA for providing his esteemed support on this research entry.
  • The Y-DNA Haplogroup Q migration route maps are courtesy of the Genographic Project and FTDNA.

Addendum I [5/08/2012]: It has been brought to my attention that Tur_yQP_3, the Assyrian Q1b1a's best match, is in fact an Armenian individual. Although this does not compromise the conclusions reached above, it does serve as a reminder that not everyone in the Republic of Turkey is an ethnic Turk!
Addenum II [6/08/2012]: A recent exchange on a forum highlighted the likelihood of several Turk_yQP samples being Armenian rather than Anatolian Turkish. As above, the findings shouldn't impede too greatly on what has been discussed in this entry.

1. Grugni V, Battaglia V, Hooshiar Kashani B, Parolo S, Al-Zahery N, et al. (2012) Ancient Migratory Events in the Middle East: New Clues from the Y-Chromosome Variation of Modern Iranians. PLoS ONE 7(7): e41252. doi:10.1371/journal.pone.

2. Haber M, Platt DE, Badro DA, Xue Y, El-Sibai M, Bonab MA, Youhanna SC, Saade S, Soria-Hernanz DF, Royyuru A, Wells RS, Tyler-Smith C, Zalloua PA; Genographic Consortium. Influences of history, geography, and religion on genetic structure: the Maronites in Lebanon. Eur J Hum Genet. 2011 Mar;19(3):334-40. Epub 2010 Dec 1.

3. Al-Zahery N, Semino O, Benuzzi G, Magri C, Passarino G, Torroni A, Santachiara-Benerecetti AS. Y-chromosome and mtDNA polymorphisms in Iraq, a crossroad of the early human dispersal and of post-Neolithic migrations. Mol Phylogenet Evol. 2003 Sep;28(3):458-72.

4. Abu-Amero KK, Hellani A, González AM, Larruga JM, Cabrera VM, Underhill PA. Saudi Arabian Y-Chromosome diversity and its relationship with nearby regions. BMC Genet. 2009 Sep 22;10:59.

5. Cinnioğlu C, King R, Kivisild T, Kalfoğlu E, Atasoy S, Cavalleri GL, Lillie AS, Roseman CC, Lin AA, Prince K, Oefner PJ, Shen P, Semino O, Cavalli-Sforza LL, Underhill PA. Excavating Y-chromosome haplotype strata in Anatolia. Hum Genet. 2004 Jan;114(2):127-48. Epub 2003 Oct 29.

6. Gokcumen Ö, Gultekin T, Alakoc YD, Tug A, Gulec E, Schurr TG. Biological ancestries, kinship connections, and projected identities in four central Anatolian settlements: insights from culturally contextualized genetic anthropology. Am Anthropol. 2011;113(1):116-31.

7. Roewer L, Willuweit S, Stoneking M, Nasidze I. A Y-STR database of Iranian and Azerbaijanian minority populations. Forensic Sci Int Genet. 2009 Dec;4(1):e53-5. Epub 2009 Jun 5.

8. Dulik MC, Osipova LP, Schurr TG. Y-chromosome variation in Altaian Kazakhs reveals a common paternal gene pool for Kazakhs and the influence of Mongolian expansions. PLoS One. 2011 Mar 11;6(3):e17548.

9. Haber M, Platt DE, Ashrafian Bonab M, Youhanna SC, Soria-Hernanz DF, et al. (2012) Afghanistan's Ethnic Groups Share a Y-Chromosomal Heritage Structured by Historical Events. PLoS ONE 7(3): e34288. doi:10.1371/journal.pone.0034288

10. Tenzin Gayden, Alicia M. Cadenas, Maria Regueiro, Nanda B. Singh, Lev A. Zhivotovsky, Peter A. Underhill, Luigi L. Cavalli-Sforza, and Rene J. Herrera. The Himalayas as a Directional Barrier to Gene Flow. Am J Hum Genet. 2007 May; 80(5): 884–894.

11. Lacau H, Bukhari A, Gayden T, La Salvia J, Regueiro M, Stojkovic O, Herrera RJ. Y-STR profiling in two Afghanistan populations. Leg Med (Tokyo). 2011 Mar;13(2):103-8. Epub 2011 Jan 14.