FASTCAR: Rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models

James, Benjamin; Luczak, Brian B; Girgis, Hani Z.

doi:10.1101/380824

Cited by 4 publications

(6 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The labels in the case at hand are the locations of known MS. A self-supervised algorithm can generate its own labels. Earlier, we successfully invented self-supervised systems for predicting enhancers 27 , masking repeats 21 , clustering DNA sequences 28 , and predicting the identity score of two sequences in linear time 29 . Next, we discuss (i) how the labels are generated, (ii) how the HMM is trained, and (iii) how the module auto-calibrates itself.…”

Section: The Training Modulementioning

confidence: 99%

“…GLMs have been applied broadly in bioinformatics. Previously, we have applied GLMs to ranking the quality of predicted protein structures [31][32][33] , filtering out spurious MS 22 , and predicting the similarity between two DNA sequences 28,29 . We have devised a similar GLM-based classifier in MeShClust 28 , which is a tool for clustering DNA sequences, and a similar GLM-based regression model in FASTCAR 29 , which is a tool for approximating the identity score between two DNA sequences in linear time.…”

Section: Predicting Identity Scores Using K-mer Featuresmentioning

confidence: 99%

“…The feature selection algorithm uses a greedy approach that selects the best feature at each step. This feature-selection strategy is the same approach we utilized in FASTCAR 29 , whereas MeShClust's classifier has four predetermined features and does not utilize this feature selection algorithm. Doing these steps accumulates features that improve the mean error (the absolute difference between the predicted identity score and that due to the alignment algorithm) at every step on the testing set, which is different from the training set.…”

Section: Selecting Featuresmentioning

confidence: 99%

See 2 more Smart Citations

Look4TRs: A de-novo tool for detecting simple tandem repeats using self-supervised hidden Markov models

Velasco

James

Wells

et al. 2018

Preprint

Self Cite

View full text Add to dashboard Cite

Simple tandem repeats, microsatellites in particular, have regulatory functions, links to several diseases, and applications in biotechnology. Sequences of thousands of species will be available soon. There is immediate need for an accurate tool for detecting microsatellites in the new genomes. The current available tools have limitations. As a remedy, we proposed Look4TRs, which is the first application of self-supervised hidden Markov models to discovering microsatellites. It adapts itself to the input genomes, balancing high sensitivity and low false positive rate. It auto-calibrates itself, freeing the user from adjusting the parameters manually, leading to consistent results across different studies. We evaluated Look4TRs on eight genomes. Based on F-measure, which combines sensitivity and false positive rate, Look4TRs outperformed TRF and MISAthe most widely-used tools -by 106% and 82%. Look4TRs outperformed the second best tool, MsDetector or Tantan, by 11%. Look4TRs represents technical advances in the annotation of microsatellites.

show abstract

Section: The Training Modulementioning

confidence: 99%

Section: Predicting Identity Scores Using K-mer Featuresmentioning

confidence: 99%

Section: Selecting Featuresmentioning

confidence: 99%

See 1 more Smart Citation

Look4TRs: A de-novo tool for detecting simple tandem repeats using self-supervised hidden Markov models

Velasco

James

Wells

et al. 2018

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Horner's rule is used for efficiently converting the number from its quaternary to its decimal representation [56]. We have used similar data structures successfully in other software tools [57][58][59][60]. For example, the 5-mer ACCTG is transformed to 01132 (base 4) and then to 94 (base 10), mapping it to the 94 th cell in the array.…”

Section: Scoring the Input Sequencementioning

confidence: 99%

LtrDetector: A modern tool-suite for detecting long terminal repeat retrotransposons de-novo on the genomic scale

2018

Preprint

View full text Add to dashboard Cite

Long terminal repeat retrotransposons are the most abundant transposons in plants. They play important roles in alternative splicing, recombination, gene regulation, and genomic evolution. Large-scale sequencing projects for plant genomes are currently underway. Software tools are important for annotating long terminal repeat retrotransposons in these newly available genomes. However, the available tools are not very sensitive to known elements and perform inconsistently on different genomes. Some are hard to install or obsolete. They may struggle to process large plant genomes. None are concurrent or have features to support manual review of new elements. To overcome these limitations, we developed LtrDetector, which uses signal-processing techniques. LtrDetector is easy to install and use. It is not species specific. It utilizes multi-core processors available in personal computers. It is more sensitive than other tools by 14.4%-50.8% while maintaining a low false positive rate on six plant genomes.

show abstract

“…This means that the identity score can be calculated without actually calculating the alignment itself. We have applied this idea successfully in FASTCAR, which is a search tool for approximating the alignment identity score in linear time 27 . This alignment-free adaptation will allow for much faster clustering with comparable accuracy while extending the ability of MeShClust to cluster long DNA sequences.…”

Section: Introductionmentioning

confidence: 99%

MeShClust²: Application of alignment-free identity scores in clustering long DNA sequences

2018

Preprint

Self Cite

View full text Add to dashboard Cite

Grouping sequences into similar clusters is an important part of sequence analysis. Widely used clustering tools sacrifice quality for speed. Previously, we developed MeShClust, which utilizes k-mer counts in an alignment-assisted classifier and the mean-shift algorithm for clustering DNA sequences. Although MeShClust outperformed related tools in terms of cluster quality, the alignment algorithm used for generating training data for the classifier was not scalable to longer sequences. In contrast, MeShClust 2 generates semi-synthetic sequence pairs with known mutation rates, avoiding alignment algorithms. MeShClust 2 clustered 3600 bacterial genomes, providing a utility for clustering long sequences using identity scores for the first time. 4/1310/13

show abstract

FASTCAR: Rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models

Cited by 4 publications

References 57 publications

Look4TRs: A de-novo tool for detecting simple tandem repeats using self-supervised hidden Markov models

Look4TRs: A de-novo tool for detecting simple tandem repeats using self-supervised hidden Markov models

LtrDetector: A modern tool-suite for detecting long terminal repeat retrotransposons de-novo on the genomic scale

MeShClust²: Application of alignment-free identity scores in clustering long DNA sequences

Contact Info

Product

Resources

About

FASTCAR: Rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models

Cited by 4 publications

References 57 publications

Look4TRs: A de-novo tool for detecting simple tandem repeats using self-supervised hidden Markov models

Look4TRs: A de-novo tool for detecting simple tandem repeats using self-supervised hidden Markov models

LtrDetector: A modern tool-suite for detecting long terminal repeat retrotransposons de-novo on the genomic scale

MeShClust2: Application of alignment-free identity scores in clustering long DNA sequences

Contact Info

Product

Resources

About

MeShClust²: Application of alignment-free identity scores in clustering long DNA sequences