An efficient similarity search based on indexing in large DNA databases

Jeong, In-Seon; Park, Kyoung-Wook; Kang, Seung‐Ho; Lim, Hyeong-Seok

doi:10.1016/j.compbiolchem.2010.03.007

Cited by 7 publications

(5 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Literature [ 4 ] studies the hash index structure of the one-way hash function and the index retrieval method to search for specific fragments and similar sequences. Literature [ 5 ] also proposed a novel solution for searching for specific DNA sequences. For the construction of the hash index structure, in DNA sequence matching [ 6 ], the commonly used fixed sequences are stored in the DNA database, and the similarity is used to evaluate whether the sequences are matched successfully.…”

Section: Introductionmentioning

confidence: 99%

[Retracted] Gene Position Index Mutation Detection Algorithm Based on Feedback Fast Learning Neural Network

Zuo

Tang

et al. 2021

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

In the detection of genome variation, the research on the internal correlation of reference genome is deepening; the detection of variation in genome sequence has become the focus of research, and it has also become an effective path to find new genes and new functional proteins. The targeted sequencing sequence is used to sequence the exon region of a specific gene in cancer gene detection, and the sequencing depth is relatively large. Traditional alignment algorithms will lose some sequences, which will lead to inaccurate mutation detection. This paper proposes a mutation detection algorithm based on feedback fast learning neural network position index. By establishing a position index relationship for ACGT in the DNA sequence, the subsequence is decomposed into the position relationship of different subsequences corresponding to the main sequence. The positional relationship of the subsequence in the main sequence is determined by the positional relationship. Analyzing SNP and InDel mutations, even structural mutations, through the position correlation of sequences has the advantages of high precision and easy implementation by personal computers. The feedback fast learning neural network is used to verify whether there is a linear relationship between two or more positions. Experimental results show that the mutation points detected by position index are more than those detected by Bcftools, Freebye, Vanscan2, and Gatk.

show abstract

Section: Introductionmentioning

confidence: 99%

[Retracted] Gene Position Index Mutation Detection Algorithm Based on Feedback Fast Learning Neural Network

Zuo

Tang

et al. 2021

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

show abstract

“…Increasing N can increase the amount of information stored in the vectors, but it also increases the computational cost of generating the vectors and calculating the similarity. According to a suggestion in [1], and based on our own experiments selecting N=1,2,3,4, we has been found that N=2 gives good results for comparisons and is also computationally efficient. The pseudocode for Algorithm 1 outlines the steps to transform an input DNA subsequence into a 48-dimensional numerical vector based on the formulas (1), ( 2), (3), and (4).…”

Section:  N-grams Selectionmentioning

confidence: 99%

“…The similarity value of two vectors is determined by the distance between the two vectors. To calculate this distance, the algorithm presented in [1] is employed. This algorithm calculates the distance between two vectors by finding the maximum number of operations required to transform from vector u to vector v. Algorithm 6 will calculate these values for each pair of (u,v) and storing the result in the variables posDis and negDis.…”

Section: B the Combine Algorithm Transforms Dna Sequences Into Vectorsmentioning

confidence: 99%

“…In similarity search, the commonly used method to calculate similarity value between two sequences is Edit Distance (ED) (also known as Levenshtein). The ED similarity value between two sequences is the minimum number of steps required to transform one sequence to other, based on three transformations: adding, editing, and deleting each character in the sequence [1]. For example, similarity value between LOVE and MOVIE is ED(LOVE, MOVIE) = 2 because two steps of LOVE → MOVE → MOVIE are needed.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An Algorithm Transform DNA Sequences to Improve Accuracy in Similarity Search

Tung¹,

Quang²

2023

IJACSA

View full text Add to dashboard Cite

Similarity search of DNA sequences is a fundamental problem in the bioinformatics, serving as the basis for many other problems. In this, the calculation of the similarity value between sequences is the most important, with the Edit distance (ED) commonly used due to its high accuracy, but slow speed. With the advantage of transforming the original DNA sequences into numerical vector form that retaining unique features based on properties. The calculation processing on these transformed data will be much faster, many times faster than a direct comparison on the original sequence. Additionally, from a long DNA sequence, after transformation, it typically has a lower storage capacity, making it have good data compression. The challenge of this job is to develop algorithms based on features that maintain biological significance while ensuring search accuracy, which is also the problem to be solved. Previous methods often used pure mathematical statistics such as frequency statistics and matrix transformations to construct features. In this paper, an improved algorithm is proposed based on both biological significances and mathematical statistics to transforming gene data into numerical vectors for ease of storage and to improve accuracy in similarity search between DNA sequences. Based on the experimental results, the new algorithm improves the accuracy of similarity calculations while maintaining good performance.

show abstract

“…These databases serve as valuable resources for numerous essential bioinformatics tasks, such as DNA similarity search [1] , sequence alignments [2] , gene annotation [3] , [4] , gene prediction [5] , [6] , and motif finding [7] , [8] . However, as these databases store vast volumes of sequences, performing these bioinformatics tasks is becoming increasingly challenging and complex.…”

Section: Introductionmentioning

confidence: 99%

NeuralBeds: Neural embeddings for efficient DNA data compression and optimized similarity search

Sarumi,

Hahn,

Heider

2024

Computational and Structural Biotechnology Journal

View full text Add to dashboard Cite

An efficient similarity search based on indexing in large DNA databases

Cited by 7 publications

References 8 publications

[Retracted] Gene Position Index Mutation Detection Algorithm Based on Feedback Fast Learning Neural Network

[Retracted] Gene Position Index Mutation Detection Algorithm Based on Feedback Fast Learning Neural Network

An Algorithm Transform DNA Sequences to Improve Accuracy in Similarity Search

NeuralBeds: Neural embeddings for efficient DNA data compression and optimized similarity search

Contact Info

Product

Resources

About