Tuesday, July 28, 2015

Comparison of Online Y-STR Predictors (Petrejcíková et al.) [Review]

An interesting study was published in 2014 based on Slovak Y-STR samples testing for 12 microsatellite markers. The main scope of this paper appears to be the investigation of the efficacy of three publicly available Y-STR haplogroup predictors (Athey, Cullen and YPredictor in alphabetical order) based on these 12 Y-STRs. Study contents shown below.

Y-SNP analysis versus Y-haplogroup predictor in the Slovak population.
Petrejcíková E, Carnogurská J, Hronská D, Bernasovská J, Boronová I, Gabriková D, Bôziková A, Maceková S. Anthropol Anz. 2014;71(3):275-85.
Human Y-chromosome haplogroups are important markers used mainly in population genetic studies. The haplogroups are defined by several SNPs according to the phylogeny and international nomenclature. The alternative method to estimate the Y-chromosome haplogroups is to predict Y-chromosome haplotypes from a set of Y-STR markers using software for Y-haplogroup prediction. The purpose of this study was to compare the accuracy of three types of Y-haplogroup prediction software and to determine the structure of Slovak population revealed by the Y-chromosome haplogroups. We used a sample of 166 Slovak males in which 12 Y-STR markers were genotyped in our previous study. These results were analyzed by three different software products that predict Y-haplogroups. To estimate the accuracy of these prediction software, Y-haplogroups were determined in the same sample by genotyping Y-chromosome SNPs. Haplogroups were correctly predicted in 98.80% (Whit Athey's Haplogroup Predictor), 97.59% (Jim Cullen's Haplogroup Predictor) and 98.19% (YPredictor by Vadim Urasin 1.5.0) of individuals. The occurrence of errors in Y-chromosome haplogroup prediction suggests that the validation using SNP analysis is appropriate when high accuracy is required. The results of SNP based haplotype determination indicate that 39.15% of the Slovak population belongs to R1a-M198 lineage, which is one of the main European lineages.
[Abstract] [Direct Link]

Are They Really Comparable?
Although all three predictors returned similar efficacy rates (~97-99%), it should be noted the authors' chief divisions of interest appear to be the conventional subclade designations currently used in both literature and the genetic genealogy community (e.g. R1a1a-M198). The authors correctly state Y-SNP testing is paramount in definitively gauging subclade classifications, especially for lines substantially downstream of a given haplogroup's phylogeny.

The rest of this entry determines whether these calculators display any other features which may give aspiring researchers reasons to choose one over another.

Subclade Coverage
A substantial difference is observed between the three. Athey's output is oriented around 21 categories spread across most of the major clades/subclades, although haplogroups not commonly found in West Eurasia (e.g. A-D) are unrepresented. Cullen improves on this significantly with 86 subclades, with Y-DNA I receiving the most attention (R1b to a lesser extent), with some improvements, such as well as the inclusion of "A&B". YPredictor has the highest count, hosting over 100 subclades, with the majority found in Y-DNA haplogroups E, G, J, N and R. With the exception of Y-DNA M and S, all are accounted for here.

STR count
Athey is capable of handling 111 Y-STR's (21 and 27-STR versions also available) with the format being listed in either numerical or Family Tree DNA (FTDNA) order. Cullen accepts a maximum of 67 STR's. YPredictor houses approximately 82 STR's. As such, all three are capable of handling a considerable number.

All three predictors permit the use of batched data and provide different means of categorising the data as seen fit by the user. Instructions are adequately provided for all three as well. As a research utility, however, YPredictor stands out through its' custom YFiler iterations (widely-used format in population genetics publications concerning Y-STRs) and debug feedback before predictions are made by the calculator.

Computational Time
This varies based on the user's CPU processing time, as well as whether they are manually entering STR values or inserting batched data. As such, this probably shouldn't be a pertinent factor in deciding which calculator to use.

Output Information
All three produce similar information (subclade prediction with probability expressed as a percentage).

Before summarising these findings, it is worth noting that Athey's predictor precedes Cullen's and YPredictor. As such, any perceived deficiencies in subclade breakdown or functionality are likely a result of age. Athey's predictor was widely used in the past, irrespective of the current application rate.

All three predictors are of use to genetic genealogists. This entry concludes the following "idealised" purposes for each:

  • Athey - For users keen to utilise upwards of 111 FTDNA Y-STR's as cross-validation against the other two
  • Cullen - Best for those seeking refined Y-DNA I or R1b subclade predictions
  • YPredictor - Most versatile and research-friendly, best worldwide coverage of Y-DNA subclades

As such, the three calculators certainly are comparable for making basic Y-STR predictions for West Eurasians, but obvious differences exist with respect to non-West Eurasian subclade coverage.

If compelled to make a single choice, I would recommend Cullen first to genetic genealogists of Northwest European paternal heritage (given the high frequencies of Y-DNA's I and R1b). YPredictor would be the best choice for those belonging to subclades more common outside Europe. This also explains why it has been extensively used in this blog to date. Athey's function has otherwise been usurped by the other two. 

No comments:

Post a Comment