Block edit models for approximate string matching

Lopresti, Daniel; Tomkins, Andrew

doi:10.1016/s0304-3975(96)00268-x

Cited by 80 publications

(40 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Computing the edit distance in the presence of such large-scale operations is typically NP-hard, depending on the exact model [10,11], and authors concentrate on providing approximation algorithms [12][13][14] or algorithms for special cases where editing proceeds from left to right [15]. In general, these problems differ from ours by allowing a richer set of operations, but also by fixing both endpoints of the history, whereas we minimize over all possible initial permutations.…”

Section: (A) Motivation and Related Workmentioning

confidence: 99%

Fast Computation of a String Duplication History under No-Breakpoint-Reuse

Brejová

Landau

Vinař

2011

String Processing and Information Retrieval

View full text Add to dashboard Cite

Section: (A) Motivation and Related Workmentioning

confidence: 99%

Fast Computation of a String Duplication History under No-Breakpoint-Reuse

Brejová

Landau

Vinař

2011

String Processing and Information Retrieval

View full text Add to dashboard Cite

“…A number of similarity functions for approximately matching strings have been proposed in the research literature. Popular measures include the Jaccard coefficient and Cosine similarity metrics from information retrieval (IR) [19,8], extensions (of Cosine similarity) to use q-grams instead of words [17], and the edit distance family of functions [10,24,18,22]. We use sima(u, v) to denote the similarity between strings u and v when u and v are considered as values of the attribute a.…”

Section: Similarity Between Value Pairsmentioning

confidence: 99%

“…The edit distance metric works well for typographical errors but it cannot capture word rearrangements, insertions, and deletions. To address this, numerous variants of the edit distance metric have been proposed in the literature like affine gap distance [24] that allows gap mismatches, block edit distance [18] that allows word moves, and a fuzzy match similarity function that allows words to be inserted/deleted with a cost equal to the IDF weight of the word [22]. However, most variants either do not handle word rearrangements well, or are too expensive from a computation perspective.…”

Section: Related Workmentioning

confidence: 99%

Exploiting content redundancy for web information extraction

Gulhane¹,

Rastogi²,

Sengamedu³

et al. 2010

Proceedings of the 19th International Conference on World Wide Web

View full text Add to dashboard Cite

We propose a novel extraction approach that exploits content redundancy on the web to extract structured data from template-based web sites. We start by populating a seed database with records extracted from a few initial sites. We then identify values within the pages of each new site that match attribute values contained in the seed set of records. To match attribute values with diverse representations across sites, we define a new similarity metric that leverages the templatized structure of attribute content. Specifically, our metric discovers the matching pattern between attribute values from two sites, and uses this to ignore extraneous portions of attribute values when computing similarity scores. Further, to filter out noisy attribute value matches, we exploit the fact that attribute values occur at fixed positions within template-based sites. We develop an efficient Apriori-style algorithm to systematically enumerate attribute position configurations with sufficient matching values across pages. Finally, we conduct an extensive experimental study with real-life web data to demonstrate the effectiveness of our extraction approach.

show abstract

“…Unfortunately, many of the interesting varieties of the block edit problem are NP-complete. 19 An NP-complete block edit problem can be solved optimally for a fixed input size-larger than is feasible with present-day computers-using postand at-fabrication time computation. Although approximation algorithms may exist for finding suboptimal solutions, we are interested in finding the optimal solution.…”

Section: An Example: Solving the Block Edit Problemmentioning

confidence: 99%