2010
DOI: 10.1504/ijdmb.2010.034196
|View full text |Cite
|
Sign up to set email alerts
|

Detecting duplicate biological entities using Shortest Path Edit Distance

Abstract: Duplicate entity detection in biological data is an important research task. In this paper, we propose a novel and context-sensitive Shortest Path Edit Distance (SPED) extending and supplementing our previous work on Markov Random Field-based Edit Distance (MRFED). SPED transforms the edit distance computational problem to the calculation of the shortest path among two selected vertices of a graph. We produce several modifications of SPED by applying Levenshtein, arithmetic mean, histogram difference and TFIDF… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2011
2011
2017
2017

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 16 publications
(9 citation statements)
references
References 30 publications
0
9
0
Order By: Relevance
“…By analysing the support and confidence of each rule, the method can show the presence of erroneous data. Other approaches use approximate string matching to compute metadata similarity [10,11,12], under the assumption that duplicates have high metadata similarity. Some approaches consider duplicates at the sequence level; they examine sequence similarity and use a similarity threshold to identify duplicates.…”
Section: Duplicate Recordsmentioning
confidence: 99%
“…By analysing the support and confidence of each rule, the method can show the presence of erroneous data. Other approaches use approximate string matching to compute metadata similarity [10,11,12], under the assumption that duplicates have high metadata similarity. Some approaches consider duplicates at the sequence level; they examine sequence similarity and use a similarity threshold to identify duplicates.…”
Section: Duplicate Recordsmentioning
confidence: 99%
“…Thirdly, the extracted noun phrases were compared with GO terms, and the number of matched phrases was stored along with the phrases. The comparison between extracted phrases and GO terms was based on string similarity between the 2, and the shortest path-based edit distance (SPED) technique [12] was used. The SPED technique is a variation of Markov random field-based edit distance (MRFED) and calculates the shortest path between 2 selected vertices of a graph.…”
Section: Datasets and Methodsmentioning
confidence: 99%
“…By analysing the support and confidence of each rule, the method can show the presence of erroneous data. Other approaches also use approximate string matching to compute metadata similarity [46,52,10]. However, as they focus only on metadata, the underlying interpretation is that duplicates are assumed to have high metadata similarity, or that their sequences are identical.…”
Section: Related Workmentioning
confidence: 99%