2018
DOI: 10.1101/380824
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

FASTCAR: Rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models

Abstract: Pairwise alignment has been the predominant algorithm in the field of bioinformatics since its beginning. Several applications have been made in order to speed up this algorithm using heuristics, but almost all of these methods still depend on the slow quadratic alignment algorithm. Many applications utilize sequence identity scores without the corresponding alignments, e.g. scanning a database for similar sequences to a query sequence or sequence clustering. For these applications, we propose FASTCAR, which i… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2018
2018
2019
2019

Publication Types

Select...
3
1

Relationship

3
1

Authors

Journals

citations
Cited by 4 publications
(6 citation statements)
references
References 57 publications
0
6
0
Order By: Relevance
“…The labels in the case at hand are the locations of known MS. A self-supervised algorithm can generate its own labels. Earlier, we successfully invented self-supervised systems for predicting enhancers 27 , masking repeats 21 , clustering DNA sequences 28 , and predicting the identity score of two sequences in linear time 29 . Next, we discuss (i) how the labels are generated, (ii) how the HMM is trained, and (iii) how the module auto-calibrates itself.…”
Section: The Training Modulementioning
confidence: 99%
See 2 more Smart Citations
“…The labels in the case at hand are the locations of known MS. A self-supervised algorithm can generate its own labels. Earlier, we successfully invented self-supervised systems for predicting enhancers 27 , masking repeats 21 , clustering DNA sequences 28 , and predicting the identity score of two sequences in linear time 29 . Next, we discuss (i) how the labels are generated, (ii) how the HMM is trained, and (iii) how the module auto-calibrates itself.…”
Section: The Training Modulementioning
confidence: 99%
“…GLMs have been applied broadly in bioinformatics. Previously, we have applied GLMs to ranking the quality of predicted protein structures [31][32][33] , filtering out spurious MS 22 , and predicting the similarity between two DNA sequences 28,29 . We have devised a similar GLM-based classifier in MeShClust 28 , which is a tool for clustering DNA sequences, and a similar GLM-based regression model in FASTCAR 29 , which is a tool for approximating the identity score between two DNA sequences in linear time.…”
Section: Predicting Identity Scores Using K-mer Featuresmentioning
confidence: 99%
See 1 more Smart Citation
“…Horner's rule is used for efficiently converting the number from its quaternary to its decimal representation [56]. We have used similar data structures successfully in other software tools [57][58][59][60]. For example, the 5-mer ACCTG is transformed to 01132 (base 4) and then to 94 (base 10), mapping it to the 94 th cell in the array.…”
Section: Scoring the Input Sequencementioning
confidence: 99%
“…This means that the identity score can be calculated without actually calculating the alignment itself. We have applied this idea successfully in FASTCAR, which is a search tool for approximating the alignment identity score in linear time 27 . This alignment-free adaptation will allow for much faster clustering with comparable accuracy while extending the ability of MeShClust to cluster long DNA sequences.…”
Section: Introductionmentioning
confidence: 99%