Combined syntactical categories and sequence alignment algorithms are implemented and used to weed-out duplicate and near-duplicate web-pages from search engine results. The syntactical structures manifested as POS-tags were pre-processed using a POS tagger converting parts of a webpage's text into a string of tags. The produced string was then subjected into the longest Common Sequence (LCS) techniques (as is commonly done in computational biology), to detect duplicate and nearduplicate webpages. The process of tagging and aligning was based on set of sentences extracted from the web page as a representative of the pages. The query-keywords are used as a basis for sentence extraction. Results obtained from experiments performed have shown that such a combined approach can provide very interesting similarity calculation and re-ranking measure. This can be used with reasonable efficiency to detect duplications on search results generated by search engines such as Google. Similarity measurements obtained can be further used as a basis for text analysis of the search results allowing the detection of duplicate and near duplicates and clustering of documents in general.
This paper reports on experiments performed to investigate the use of a combined Part of Speech (POS) and an improved Longest Common Subsequence (LCS) in the analysis and calculation of similarity between texts. The text's syntactical structures were used as a representation for the documents. An improved LCS algorithm was applied to such a representation in order to compare and rank the documents according to the similarity of their representative strings. The approach was applied in the detection of duplicate documents within a corpus, and in the filtering of search engine results. Obtained results were encouraging.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.