Motivation:The presence of present-day human contaminating DNA fragments is one of the chal-2 lenges defining ancient DNA (aDNA) research. This is especially relevant to the ancient human DNA 3 field where it is difficult to distinguish endogenous molecules from human contaminants due to their 4 genetic similarity. Recently, with the advent of high-throughput sequencing and new aDNA protocols, 5 hundreds of ancient human genomes have become available. Contamination in those genomes has 6 been measured with computational methods often developed specifically for these empirical studies.
7Consequently, some of these methods have not been implemented and tested while few are aimed at 8 low-depth data, a common feature in aDNA datasets.
9Results: We develop a new X-chromosome-based maximum likelihood method for estimating present-10 day human contamination in low-depth sequencing data. We implement our method for general use, 11 assess its performance under conditions typical of ancient human DNA research, and compare it to 12 previous nuclear data-based methods through extensive simulations. For low-depth data, we show that 13 existing methods can produce unusable estimates or substantially underestimate contamination. In 14 contrast, our method provides accurate estimates for a depth of coverage as low as 0.5× on the X-15 chromosome when contamination is below 25%. Moreover, our method still yields meaningful estimates 16 in very challenging situations, i.e., when the contaminant and the target come from closely related 17 populations or with increased error rates. With a running time below five minutes, our method is 18 applicable to large scale aDNA genomic studies. 22 65 contamination via the incorporation of the intrinsic characteristics of endogenous aDNA fragments 66 into the model (Renaud et al., 2015). 67 68Autosomes-based methods
69Sequencing high depth ancient nuclear genomes remains challenging. Therefore, mtDNA-based con-70 2 tamination estimates have been used as a proxy for overall contamination (Allentoft et al., 2015). Yet, 71 different mitochondrial-to-nuclear DNA ratios in the endogenous source and the human contaminant(s) 72 may lead to inaccurate conclusions (Furtwängler et al., 2018). While the source of this difference has 73 yet to be identified, accurate methods based on nuclear data are needed to estimate the level of human 74 contamination which may have an impact on downstream analyses (Renaud et al., 2016). Indeed, 75 most studies rely on nuclear data to answer key biological questions. A recent method (DICE) aims 76 at estimating present-day human contamination for nuclear data (Racimo et al., 2016). It does so 77 by co-estimating contamination, sequencing error, and demography based on autosomal data. This 78 method generally requires an intermediate depth of coverage (at least 3×) and produces more accurate 79 results when the sample and the contaminant are genetically distant (e.g. different species or highly 80 differentiated populations). 81 82 X-chromosome-based methods and a no...