Local text reuse detection

Seo, Jangwon; Croft, W. Bruce

doi:10.1145/1390334.1390432

Cited by 75 publications

(69 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Many duplicate detection systems for the web use p = 50 or above, which drastically reduces index sizes [6]. This downsampling technique is less effective for local as opposed to global text reuse [10] and can also hurt recall for noisily OCR'd documents, as we see below.…”

Section: Downsampling Document Featuresmentioning

confidence: 99%

See 1 more Smart Citation

Detecting and modeling local text reuse

Smith

Cordel

Dillon

et al. 2014

IEEE/ACM Joint Conference on Digital Libraries

View full text Add to dashboard Cite

Texts propagate through many social networks and provide evidence for their structure. We describe and evaluate efficient algorithms for detecting clusters of reused passages embedded within longer documents in large collections. We apply these techniques to two case studies: analyzing the culture of free reprinting in the nineteenth-century United States and the development of bills into legislation in the U.S. Congress. Using these divergent case studies, we evaluate both the efficiency of the approximate local text reuse detection methods and the accuracy of the results. These techniques allow us to explore how ideas spread, which ideas spread, and which subgroups shared ideas.

show abstract

Section: Downsampling Document Featuresmentioning

confidence: 99%

“…• We are looking for reuse of substantial amounts of text, on the order of 100 words or more, in contrast to work on detecting shorter quotations [9,10,11,2].…”

mentioning

confidence: 99%

Detecting and modeling local text reuse

Smith

Cordel

Dillon

et al. 2014

IEEE/ACM Joint Conference on Digital Libraries

View full text Add to dashboard Cite

show abstract

“…Researchers tackle this issue by omitting some of the grams when building the index. Depending on the techniques used, discarding some grams may [13], [14], [15] or may not [16], [17] miss some similar pairs.…”

Section: Introductionmentioning

confidence: 99%

VChunkJoin: An Efficient Algorithm for Edit Similarity Joins

Wang

Qin

Xiao

et al. 2013

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

Abstract-Similarity joins play an important role in many application areas, such as data integration and cleaning, record linkage, and pattern recognition. In this paper, we study efficient algorithms for similarity joins with an edit distance constraint. Currently, the most prevalent approach is based on extracting overlapping grams from strings and considering only strings that share a certain number of grams as candidates. Unlike these existing approaches, we propose a novel approach to edit similarity join based on extracting non-overlapping substrings, or chunks, from strings. We propose a class of chunking schemes based on the notion of tail-restricted chunk boundary dictionary. A new algorithm, VChunkJoin, is designed by integrating existing filtering methods and several new filters unique to our chunk-based method. We also design a greedy algorithm to automatically select a good chunking scheme for a given dataset. We demonstrate experimentally that the new algorithm is faster than alternative methods yet occupies less space.

show abstract

“…There are many different types of clipping behaviors, such as clipping the entire text, clipping a few phrases, or clipping and then correcting some errors in the text [1]. However, web surfers typically do not want to see redundant documents in search results, and a whole lot of duplicate documents make a system less efficient by consuming considerable resources [2]. For some popular applications such as spam site detection and duplicate web page removal in search engines [3], some duplicate document detection approaches have been proposed as follows.…”

Section: Introductionmentioning

confidence: 99%

“…For efficient duplicate document detection, the document fingerprint is generated based on significant words without common words [4], [5], named entities and multi-word terms [6], or shingles indicating contiguous subsequences [7]. Still, these approaches cannot detect the partial duplicates that agglomerate segments of many originals [2], [3] as presented in (b) of Fig. 1 comparable document, and mark the target document as a duplicate when some segment fingerprints in the target document are retrieved from the hash table.…”

Section: Introductionmentioning

confidence: 99%

Detecting Partial and Near Duplication in the Blogosphere

Yoon

Jang

Kim

et al. 2012

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYIn this paper, we propose a duplicate document detection model recognizing both partial duplicates and near duplicates. The proposed model can detect partial duplicates as well as exact duplicates by splitting a large document into many small sentence fingerprints. Furthermore, the proposed model can detect even near duplicates, the result of trivial revisions, by filtering the common words and reordering the word sequence. key words: duplicate detection, sentence fingerprint, information retrieval, blogs

show abstract

Local text reuse detection

Cited by 75 publications

References 15 publications

Detecting and modeling local text reuse

Detecting and modeling local text reuse

VChunkJoin: An Efficient Algorithm for Edit Similarity Joins

Detecting Partial and Near Duplication in the Blogosphere

Contact Info

Product

Resources

About