2021
DOI: 10.1101/2021.01.24.427982
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Profile hidden Markov model sequence analysis can help remove putative pseudogenes from DNA barcoding and metabarcoding datasets

Abstract: BackgroundPseudogenes are non-functional copies of protein coding genes that typically follow a different molecular evolutionary path as compared to functional genes. The inclusion of pseudogene sequences in DNA barcoding and metabarcoding analysis can lead to misleading results. None of the most widely used bioinformatic pipelines used to process marker gene (metabarcode) high throughput sequencing data specifically accounts for the presence of pseudogenes in protein-coding marker genes. The purpose of this s… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
10
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(10 citation statements)
references
References 80 publications
0
10
0
Order By: Relevance
“…Our results indicate that the impacts of NUMTs can be greatly reduced by targeting longer amplicons because most NUMTs are short. Moreover, NUMTs <300 bp are less likely to contain diagnostic features such as inappropriate amino acid substitutions that can be identified via bioinformatic pipelines (Porter & Hajibabaei, 2021). Accordingly, NUMTs pose the highest risk to eDNA studies where amplicons typically range from 50 to 200 bp (Langlois et al, 2021) to dietary analyses where amplicons are 70–230 bp (Berry et al, 2017; da Silva et al, 2019), and to processed fish samples which commonly use amplicons of 127–314 bp (Shokralla et al, 2015).…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Our results indicate that the impacts of NUMTs can be greatly reduced by targeting longer amplicons because most NUMTs are short. Moreover, NUMTs <300 bp are less likely to contain diagnostic features such as inappropriate amino acid substitutions that can be identified via bioinformatic pipelines (Porter & Hajibabaei, 2021). Accordingly, NUMTs pose the highest risk to eDNA studies where amplicons typically range from 50 to 200 bp (Langlois et al, 2021) to dietary analyses where amplicons are 70–230 bp (Berry et al, 2017; da Silva et al, 2019), and to processed fish samples which commonly use amplicons of 127–314 bp (Shokralla et al, 2015).…”
Section: Discussionmentioning
confidence: 99%
“…The impact of unrecognized NUMTs on diversity estimates will differ among such approaches. Whether OTUs are or are not used, sequence arrays containing NUMTs can still accurately represent beta diversity if NUMTs are either uncommon or consistently recovered across samples (Porter & Hajibabaei, 2021).…”
Section: Discussionmentioning
confidence: 99%
“…Nevertheless it is a key utility that can easily be included in metabarcode data processing pipelines and importantly provides detailed results to evaluate data set characteristics. We were pleased to recently see another contribution to the literature arriving for peer‐review that uses a combination of open reading frame length and hidden Markov model profile analysis for numt removal (Porter & Hajibabaei, 2021). Those authors encourage the submission of verified cytochrome oxidase subunit I (COI) pseudogenes to public databases to facilitate future studies.…”
Section: Discussionmentioning
confidence: 99%
“…As we were using protein coding markers in this study, we also screened out obvious pseudogenes to try to reduce noise in the dataset and avoid inflating richness estimates 103 . For rbcL, we removed putative pseudogenes using removal method 1:…”
Section: Bioinformatic Processingmentioning
confidence: 99%