Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval 2010
DOI: 10.1145/1835449.1835562
|View full text |Cite
|
Sign up to set email alerts
|

Efficient partial-duplicate detection based on sequence matching

Abstract: With the ever-increasing growth of the Internet, numerous copies of documents become serious problem for search engine, opinion mining and many other web applications. Since partial-duplicates only contain a small piece of text taken from other sources and most existing near-duplicate detection approaches focus on document level, partial duplicates can not be dealt with well. In this paper, we propose a novel algorithm to realize the partial-duplicate detection task. Besides the similarities between documents,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
29
0

Year Published

2011
2011
2019
2019

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 39 publications
(29 citation statements)
references
References 23 publications
0
29
0
Order By: Relevance
“…Thus, the focus of plagiarism detection goes further till the level of passage. This differs greatly from former systems that was used to detect the duplicates of web sites documents and which measured similarity on the whole document levels [164,193]. The definition of plagiarism detection and its task presented before suggests that there are source documents from which the plagiarized passages are taken.…”
Section: Types Of Automatic Plagiarism Detectionmentioning
confidence: 96%
“…Thus, the focus of plagiarism detection goes further till the level of passage. This differs greatly from former systems that was used to detect the duplicates of web sites documents and which measured similarity on the whole document levels [164,193]. The definition of plagiarism detection and its task presented before suggests that there are source documents from which the plagiarized passages are taken.…”
Section: Types Of Automatic Plagiarism Detectionmentioning
confidence: 96%
“…String Based: In these techniques source code is considered as an arrangement of characters/strings/lines and uses string matching techniques to detect duplicate code [2]. Dup tool compares lexemes on behalf of string match and finds partial match [2,3,4]. Ducass et al [5] proposed dynamic matching technique to detect code clones.…”
Section: Related Workmentioning
confidence: 99%
“…We see that the baseline Shingling approach performs the best, with an F 1 = 0.81. In contrast, both I-Match and SpotSigs performed much worse (0.50, 0.70), in sharp contrast to their performance in near-duplicate detection of web pages (with F 1 near 95%) [Theobald et al 2008;Zhang et al 2010]. While these approaches work well in news articles and web pages (relatively long text), they do not work well for short text.…”
Section: Message Levelmentioning
confidence: 99%