2013
DOI: 10.1080/18756891.2013.752657
|View full text |Cite
|
Sign up to set email alerts
|

Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting

Abstract: Duplicate and near-duplicate web pages are the chief concerns for web search engines. In reality, they incur enormous space to store the indexes, ultimately slowing down and increasing the cost of serving results. A variety of techniques have been developed to identify pairs of web pages that are "similar" to each other. The problem of finding near-duplicate web pages has been a subject of research in the database and web-search communities for some years. In order to identify the near duplicate web pages, we … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2014
2014
2022
2022

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 10 publications
(12 citation statements)
references
References 24 publications
0
11
0
Order By: Relevance
“…Many existing methods have been proposed to resolve this issue and achieved many different results [18]. Generally, we have compared several evaluation metrics of our result to the previous work [1] such as, precision, recall, and -measure. …”
Section: Discussionmentioning
confidence: 99%
See 4 more Smart Citations
“…Many existing methods have been proposed to resolve this issue and achieved many different results [18]. Generally, we have compared several evaluation metrics of our result to the previous work [1] such as, precision, recall, and -measure. …”
Section: Discussionmentioning
confidence: 99%
“…In the next step, the components' sign determines the corresponding bits of the final fingerprint of document. The working procedure that applies sim-hash to generate a document to a 64-bit fingerprint is illustrated in Figure 2 and the pseudocode of the sim-hash algorithm is given in Algorithm 1 [1].…”
Section: Fingerprints Extractionmentioning
confidence: 99%
See 3 more Smart Citations