Longest Common Subsequence in k Length Substrings

Benson, Gary; Levy, Avivit; Shalom, B. Riva

doi:10.1007/978-3-642-41062-8_26

Cited by 19 publications

(20 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For this reason, a subset of anchors that satisfy the monotonicity condition needs to be selected. The problem of identifying such a subset can be expressed as finding the Longest Common Subsequence in k Length Substrings 27 (LCSk). Note that this is distinct from just finding the longest common subsequence as that ignores the information determined in the anchors and can favour alignments that have many more indels.…”

Section: Methodsmentioning

confidence: 99%

Fast and sensitive mapping of nanopore sequencing reads with GraphMap

et al. 2016

View full text Add to dashboard Cite

Realizing the democratic promise of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. Here we present GraphMap, a mapping algorithm designed to analyse nanopore sequencing reads, which progressively refines candidate alignments to robustly handle potentially high-error rates and a fast graph traversal to align long reads with speed and high precision (>95%). Evaluation on MinION sequencing data sets against short- and long-read mappers indicates that GraphMap increases mapping sensitivity by 10–80% and maps >95% of bases. GraphMap alignments enabled single-nucleotide variant calling on the human genome with increased sensitivity (15%) over the next best mapper, precise detection of structural variants from length 100 bp to 4 kbp, and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap.

show abstract

Section: Methodsmentioning

confidence: 99%

Fast and sensitive mapping of nanopore sequencing reads with GraphMap

et al. 2016

View full text Add to dashboard Cite

show abstract

“…It is clearly evident from Figure. 2 that the proposed algorithm requires the least possible computational time to compute LCSS of any two data sets, with a constant length and varying dimensionality, against field proven schemes i.e., sequential approach [9] and dynamic programming based algorithms [8], [27], [37]. Additionally, if similarity indexes of any two data sets, both real and benchmark, is high then performance of the proposed scheme is exceptionally well as shown in Figure. 3 such as if S i and T j are completely similar then the proposed approach computes their LCSS in O(m) time, where m represents the data set with maximum length.…”

Section: Resultsmentioning

confidence: 91%

A Heuristic Approach for Finding Similarity Indexes of Multivariate Data Sets

et al. 2020

View full text Add to dashboard Cite

Multivariate data sets (MDSs), with enormous size and certain ratio of noise/outliers, are generated routinely in various application domains. A major issue, tightly coupled with these MDSs, is how to compute their similarity indexes with available resources in presence of noise/outliers-which is addressed with the development of both classical and non-metric based approaches. However, classical techniques are sensitive to outliers and most of the non-classical approaches are either problem/application specific or overlay complex. Therefore, the development of an efficient and reliable algorithm for MDSs, with minimum time and space complexity, is highly encouraged by the research community. In this paper, a non-metric based similarity measure algorithm, for MDSs, is presented that solves the aforementioned issues, particularly, noise and computational time, successfully. This technique finds the similarity indexes of noisy MDSs, of both equal and variable sizes, through utilizing minimum possible resources i.e., space and time. Experiments were conducted with both benchmark and real time MDSs for evaluating the proposed algorithm's performance against its rival algorithms, which are traditional dynamic programming based and sequential similarity measure algorithms. Experimental results show that the proposed scheme performs exceptionally well, in terms of time and space, than its counterpart algorithms and effectively tolerates a considerable portion of noisy data. INDEX TERMS Similarity index, multivariate data set, outliers, the longest common subsequence. I. INTRODUCTION Recent technological advancements, particularly in sensors and actuators, lead to the generation of enormous multivariate data sets (MDSs) in different application areas i.e., wireless sensor networks, internet of things (IoT), scientific experiments, industrial control processes, educational purpose testbeds, web and databases [1]. An MDS is defined as a set of related numbers or values associated with a specific entity in an organization. In other words, a group of univariate data sets in columns form is known as MDS [2]. Mathematically, it is represented as a matrix X m , n , where m and n corresponds to the rows and columns respectively. These MDSs are thor-The associate editor coordinating the review of this manuscript and approving it for publication was Chongsheng Zhang. oughly examined, using various classical and non-classical approaches, to discover valuable information that is used to determine the correlating or distinguishing factor of entities. One of the major issue, closely linked with MDS, is to find their similarity indexes in the presence of noise/outliers that is not possible with existing techniques. Generally, two MDSs, X i , j and Y m , n , are believed similar if most of their elements are highly correlated [3]. MDSs similarity problem is an active research area, both in computer science and mathematics, that is due to its existence in different real world application environments i.e., DNA analysis, sensors-based real...

show abstract

“…The current implementation uses a hardcoded seed that is 12 bases long with an indel/mismatch allowed in the middle (6 matching bases, 1 indel/mismatch base, followed by 6 matching bases). GraphMap then collects seed hits, using them for finding the longest common subsequence in k-length substrings ( Benson et al , 2013 ). The output from this step is then filtered to find collinear chains of seeds (private correspondence with Ivan Sović).…”

Section: Long Read Overlap Methodologiesmentioning

confidence: 99%

Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art

et al. 2016

View full text Add to dashboard Cite

Identifying overlaps between error-prone long reads, specifically those from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PB), is essential for certain downstream applications, including error correction and de novo assembly. Though akin to the read-to-reference alignment problem, read-to-read overlap detection is a distinct problem that can benefit from specialized algorithms that perform efficiently and robustly on high error rate long reads. Here, we review the current state-of-the-art read-to-read overlap tools for error-prone long reads, including BLASR, DALIGNER, MHAP, GraphMap and Minimap. These specialized bioinformatics tools differ not just in their algorithmic designs and methodology, but also in their robustness of performance on a variety of datasets, time and memory efficiency and scalability. We highlight the algorithmic features of these tools, as well as their potential issues and biases when utilizing any particular method. To supplement our review of the algorithms, we benchmarked these tools, tracking their resource needs and computational performance, and assessed the specificity and precision of each. In the versions of the tools tested, we observed that Minimap is the most computationally efficient, specific and sensitive method on the ONT datasets tested; whereas GraphMap and DALIGNER are the most specific and sensitive methods on the tested PB datasets. The concepts surveyed may apply to future sequencing technologies, as scalability is becoming more relevant with increased sequencing throughput. Supplementary information: Supplementary data are available at Bioinformatics online.

show abstract

Longest Common Subsequence in k Length Substrings

Cited by 19 publications

References 18 publications

Fast and sensitive mapping of nanopore sequencing reads with GraphMap

Fast and sensitive mapping of nanopore sequencing reads with GraphMap

A Heuristic Approach for Finding Similarity Indexes of Multivariate Data Sets

Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art

Contact Info

Product

Resources

About