Saturday, July 2, 2016

4Mix Ancients for PuntK12 Calculator

Overview

4Mix is a nifty supplementary tool executed alongside GEDMatch calculator or ADMIXTURE outputs to establish the genetic distance and ancestral proportions of a given number of population combinations. Originally conceived by "DESEUK1" (Eurogenes Ancestry Project participant),  it has been implemented numerous times across the wider genetic genealogy community.

In light of Lazaridis et al. 2016's recent "The genetic structure of the world's first farmers" [Link], crucial aDNA from the Near-East has been published and utilised by citizen scientists.

This brief entry provides users with an immediate means of assessing their ancestral proportions with the new releases through the PuntDNAL K12 calculator.

The R script, an example target file, the population source data and ReadMe's (DESEUK1's original and my own contribution outlining the "sink" version's procedure) can be found in the link below:



Purpose of the Package

This modification was simply designed to give the wider genetic genealogy community an easy and informal means of manipulating this recent data to explore ethnogenesis or personal ancestries at their own discretion. This is not a formal assessment of the above.

Limitations

Those intending to use this 4Mix package must be aware of the following:

1) The Iran_N, Iran_ChL and Levant_N samples here are GEDMatch contributions by genome bloggers "Kurd" and "Srkz". These currently number one, two and one respectively.

2) The utilisation of these samples as references is a short-term convenience and should not be considered equivalent to ADMIXTURE runs containing these samples among them. The methodology described above opens the potential for Davidski's "Calculator Effect" to manifest.

3) Due to the continued absence of Ancestral South Indian (ASI) aDNA, the Paniya were considered a "last resort" surrogate to address the ancestral proportions South/South-Central Asian samples would generate. Furthermore, additional modern reference populations (i.e. Yoruba, Nganasan) were used to furnish other worldwide aDNA deficiencies. These populations were chosen based on their peak modal status in the K's determined by the PuntDNAL K12 calculator.

Contributions

A very special thank you to the users "jesus" and "khanabadoshi" from Anthrogenica for their guidance and assistance in modifying the package for your usage. Another extended thank you to the user "surbakhunWessste" (also from Anthrogenica) for outlining the "sink" procedure here

Saturday, March 19, 2016

Identifying Bias in Cohorts: IBD and Life Stage Effect [Review]

A very interesting paper published barely one week ago investigating the potential for bias exertion in population genetics cohorts:

Reducing bias in population and landscape genetic inferences: the effects of sampling related individuals and multiple life stages.
Peterman W1, Brocato ER2, Semlitsch RD2, Eggert LS2.
PeerJ. 2016 Mar 14;4:e1813. doi: 10.7717/peerj.1813. eCollection 2016.

"In population or landscape genetics studies, an unbiased sampling scheme is essential for generating accurate results, but logistics may lead to deviations from the sample design. Such deviations may come in the form of sampling multiple life stages. Presently, it is largely unknown what effect sampling different life stages can have on population or landscape genetic inference, or how mixing life stages can affect the parameters being measured. Additionally, the removal of siblings from a data set is considered best-practice, but direct comparisons of inferences made with and without siblings are limited. In this study, we sampled embryos, larvae, and adult Ambystoma maculatum from five ponds in Missouri, and analyzed them at 15 microsatellite loci. We calculated allelic richness, heterozygosity and effective population sizes for each life stage at each pond and tested for genetic differentiation (F ST and D C ) and isolation-by-distance (IBD) among ponds. We tested for differences in each of these measures between life stages, and in a pooled population of all life stages. All calculations were done with and without sibling pairs to assess the effect of sibling removal. We also assessed the effect of reducing the number of microsatellites used to make inference. No statistically significant differences were found among ponds or life stages for any of the population genetic measures, but patterns of IBD differed among life stages. There was significant IBD when using adult samples, but tests using embryos, larvae, or a combination of the three life stages were not significant. We found that increasing the ratio of larval or embryo samples in the analysis of genetic distance weakened the IBD relationship, and when using D C , the IBD was no longer significant when larvae and embryos exceeded 60% of the population sample. Further, power to detect an IBD relationship was reduced when fewer microsatellites were used in the analysis."
[Abstract]

How relevant is the above to human population genetics? Quite, for two reasons:
  1. Per the accepted phenomenon which props the IBD model, the study does give a unique angle with respect to sampling methods. The difference in IBD status as determined by life stage, alongside statistical demonstration of insignificance once only A. maculatum larvae and embryos were considered, confirms social mobility plays a role in obscuring intra-species IBD measurements. This is clearly mitigated in human settlements with extreme geographical isolation.
  2. More microsatellite markers are usually better - Genetic genealogists or researchers familiar with Y-chromosomal analyses are already aware of this mantra. Not a surprise to see the authors concluded their statistical power increased when the maximum number of markers were employed.
The abstract, rather unhelpfully, does not reveal the outcomes of the sibling-pair variation to their experimentation. 

A full read of the paper at some point should hopefully address the above, as well as the raw data produced through the statistical calculations.