A Review on Text Similarity Technique used in IR and its Application

Pradhan, Nitesh; Gyanchandani, Manasi; Wadhvani, Rajesh

doi:10.5120/21257-4109

Cited by 40 publications

(23 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Which means the types that are processed in the proposed system are limited in two types: Words with length ≤5 have the possibility of one error only, whereas words with length >5 allow errors with two letters as the maximum probability. Then, to get the best system performance, the proposed system used the integrated number of similarity measures which Gives the best result in case of short string and it is fast and best suited for strings similarity (Pradhan, et al, 2015;Patel, 2016) In case of long string cost of Levenshtein distance is same as the length of string and considered it is not order of sequence of characters while comparing (Pradhan, et al, 2015;Patel, 2016) Longest common subsequence -Uses the recursion approach which uses stack that takes lots of space (Pradhan, et al, 2015) Jaro-Winkler Gives better result in case of hybrid method (Pradhan, et al, 2015) If the data size is too much large, then Jaro distance similarity not gives efficient results (Pradhan, et al, 2015) …”

Section: Methodsmentioning

confidence: 99%

“…Similarity technique is high (Pradhan, et al, 2015) They are not suitable at multilingual environment, and the accuracy is very less (Pande, et al, 2013;Pradhan, et al, 2015) …”

Section: N-grammentioning

confidence: 99%

“…Obtain satisfactory results and used to consider the sizes of the two words and the similarity score will be normalized into [0,1] (Pradhan, et al, 2015) -…”

Section: Dice Coefficientmentioning

confidence: 99%

See 2 more Smart Citations

A Comparative Study for String Metrics and the Feasibility of Joining them as Combined Text Similarity Measures

Abdul-Jabbar

George

2017

ARO

View full text Add to dashboard Cite

Abstract-This paper aims to introduce an optimized DamerauLevenshtein and dice-coefficients using enumeration operations (ODADNEN) for providing fast string similarity measure with maintaining the results accuracy; searching to find specific words within a large text is a hard job which takes a lot of time and efforts. The string similarity measure plays a critical role in many searching problems. In this paper, different experiments were conducted to handle some spelling mistakes. An enhanced algorithm for string similarity assessment was proposed. This algorithm is a combined set of well-known algorithms with some improvements (e.g. the dice-coefficient was modified to deal with numbers instead of characters using certain conditions). These algorithms were adopted after conducting on a number of experimental tests to check its suitability. The ODADNN algorithm was tested using real data; its performance was compared with the original similarity measure. The results indicated that the most convincing measure is the proposed hybrid measure, which uses the Damerau-Levenshtein and dicedistance based on n-gram of each word to handle; also, it requires less processing time in comparison with the standard algorithms. Furthermore, it provides efficient results to assess the similarity between two words without the need to restrict the word length.

show abstract

Section: Methodsmentioning

confidence: 99%

“…Similarity technique is high (Pradhan, et al, 2015) They are not suitable at multilingual environment, and the accuracy is very less (Pande, et al, 2013;Pradhan, et al, 2015) …”

Section: N-grammentioning

confidence: 99%

See 1 more Smart Citation

A Comparative Study for String Metrics and the Feasibility of Joining them as Combined Text Similarity Measures

Abdul-Jabbar

George

2017

ARO

View full text Add to dashboard Cite

show abstract

“…Text similarity is a well-studied problem in information retrieval (Pradhan et al (2015); Nagwani et al (2015)). Over the years, many techniques have been proposed to measure the distance/similarity of documents based on features such as word frequencies, word patterns in sentences, etc.…”

Section: Introductionmentioning

confidence: 99%

An information-theoretic approach for measuring the distance of organ tissue samples using their transcriptomic signatures

Manatakis

VanDevender²,

Μανωλάκος

2020

Preprint

View full text Add to dashboard Cite

Motivation:Recapitulating aspects of human organ functions using in-vitro (e.g., plates, transwells, etc.), in-vivo (e.g., mouse, rat, etc.), or ex-vivo (e.g., organ chips, 3D systems, etc.) organ models are of paramount importance for precision medicine and drug discovery. It will allow us to identify potential side effects and test the effectiveness of therapeutic approaches early in their design phase and will inform the development of accurate disease models. Developing mathematical methods to reliably compare the "distance/similarity" of organ models from/to the real human organ they represent is an understudied problem with important applications in biomedicine and tissue engineering. Results: We introduce the Transctiptomic Signature Distance, TSD, an information-theoretic distance for assessing the transcriptomic similarity of two tissue samples, or two groups of tissue samples. In developing TSD, we are leveraging next-generation sequencing data and information retrieved from well-curated databases providing signature gene sets characteristic for human organs. We present the justification and mathematical development of the new distance and demonstrate its effectiveness in different scenarios of practical importance using several publicly available RNA-seq datasets.

show abstract

“…There are many surveys that review sentence similarity issue [10][11][12][13] . Unlike other surveys, this survey distinguishes between words similarity methods and sentences similarity methods.…”

Section: Introductionmentioning

confidence: 99%

Measuring Sentences Similarity: A Survey

Farouk¹

2019

Indian Journal of Science and Technology

View full text Add to dashboard Cite

Objective/Methods: This study is to review the approaches used for measuring sentences similarity. Measuring similarity between natural language sentences is a crucial task for many Natural Language Processing applications such as text classification, information retrieval, question answering, and plagiarism detection. This survey classifies approaches of calculating sentences similarity based on the adopted methodology into three categories. Word-to-word based, structurebased, and vector-based are the most widely used approaches to find sentences similarity. Findings/Application: Each approach measures relatedness between short texts based on a specific perspective. In addition, datasets that are mostly used as benchmarks for evaluating techniques in this field are introduced to provide a complete view on this issue. The approaches that combine more than one perspective give better results. Moreover, structure based similarity that measures similarity between sentences' structures needs more investigation.

show abstract

A Review on Text Similarity Technique used in IR and its Application

Cited by 40 publications

References 12 publications

A Comparative Study for String Metrics and the Feasibility of Joining them as Combined Text Similarity Measures

A Comparative Study for String Metrics and the Feasibility of Joining them as Combined Text Similarity Measures

An information-theoretic approach for measuring the distance of organ tissue samples using their transcriptomic signatures

Measuring Sentences Similarity: A Survey

Contact Info

Product

Resources

About