2009 WRI World Congress on Computer Science and Information Engineering 2009
DOI: 10.1109/csie.2009.771
|View full text |Cite
|
Sign up to set email alerts
|

Webpage Duplicate Detection Using Combined POS and Sequence Alignment Algorithm

Abstract: Combined syntactical categories and sequence alignment algorithms are implemented and used to weed-out duplicate and near-duplicate web-pages from search engine results. The syntactical structures manifested as POS-tags were pre-processed using a POS tagger converting parts of a webpage's text into a string of tags. The produced string was then subjected into the longest Common Sequence (LCS) techniques (as is commonly done in computational biology), to detect duplicate and nearduplicate webpages. The process … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Year Published

2009
2009
2017
2017

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
references
References 16 publications
0
0
0
Order By: Relevance