Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2008
DOI: 10.1145/1390334.1390432
|View full text |Cite
|
Sign up to set email alerts
|

Local text reuse detection

Abstract: Text reuse occurs in many different types of documents and for many different reasons. One form of reuse, duplicate or near-duplicate documents, has been a focus of researchers because of its importance in Web search. Local text reuse occurs when sentences, facts or passages, rather than whole documents, are reused and modified. Detecting this type of reuse can be the basis of new tools for text analysis. In this paper, we introduce a new approach to detecting local text reuse and compare it to other approache… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

1
65
0
2

Year Published

2009
2009
2018
2018

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 75 publications
(69 citation statements)
references
References 15 publications
1
65
0
2
Order By: Relevance
“…Many duplicate detection systems for the web use p = 50 or above, which drastically reduces index sizes [6]. This downsampling technique is less effective for local as opposed to global text reuse [10] and can also hurt recall for noisily OCR'd documents, as we see below.…”
Section: Downsampling Document Featuresmentioning
confidence: 99%
See 1 more Smart Citation
“…Many duplicate detection systems for the web use p = 50 or above, which drastically reduces index sizes [6]. This downsampling technique is less effective for local as opposed to global text reuse [10] and can also hurt recall for noisily OCR'd documents, as we see below.…”
Section: Downsampling Document Featuresmentioning
confidence: 99%
“…• We are looking for reuse of substantial amounts of text, on the order of 100 words or more, in contrast to work on detecting shorter quotations [9,10,11,2].…”
mentioning
confidence: 99%
“…Researchers tackle this issue by omitting some of the grams when building the index. Depending on the techniques used, discarding some grams may [13], [14], [15] or may not [16], [17] miss some similar pairs.…”
Section: Introductionmentioning
confidence: 99%
“…There are many different types of clipping behaviors, such as clipping the entire text, clipping a few phrases, or clipping and then correcting some errors in the text [1]. However, web surfers typically do not want to see redundant documents in search results, and a whole lot of duplicate documents make a system less efficient by consuming considerable resources [2]. For some popular applications such as spam site detection and duplicate web page removal in search engines [3], some duplicate document detection approaches have been proposed as follows.…”
Section: Introductionmentioning
confidence: 99%
“…For efficient duplicate document detection, the document fingerprint is generated based on significant words without common words [4], [5], named entities and multi-word terms [6], or shingles indicating contiguous subsequences [7]. Still, these approaches cannot detect the partial duplicates that agglomerate segments of many originals [2], [3] as presented in (b) of Fig. 1 comparable document, and mark the target document as a duplicate when some segment fingerprints in the target document are retrieved from the hash table.…”
Section: Introductionmentioning
confidence: 99%