2009 Fourth International Conference on Computer Sciences and Convergence Information Technology 2009
DOI: 10.1109/iccit.2009.235
|View full text |Cite
|
Sign up to set email alerts
|

Duplicate Detection in Documents and WebPages Using Improved Longest Common Subsequence and Documents Syntactical Structures

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
20
0

Year Published

2011
2011
2021
2021

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 23 publications
(20 citation statements)
references
References 21 publications
0
20
0
Order By: Relevance
“…Moreover, reduction of text to its syntactical structures reduces the dimensionality of the document allowing us to deal with much shorter strings instead of the full text. Such reduction minimizes information loss when compared with processing based on mere text of characters or group of words as practiced by various n-gram, shingle-based techniques [20,21] and IR in general. Reduction in text representative to be used for comparison enables the efficient use of sequence compression algorithms such as LCS and other string approximation methods [22,16].…”
Section: Introductionmentioning
confidence: 99%
“…Moreover, reduction of text to its syntactical structures reduces the dimensionality of the document allowing us to deal with much shorter strings instead of the full text. Such reduction minimizes information loss when compared with processing based on mere text of characters or group of words as practiced by various n-gram, shingle-based techniques [20,21] and IR in general. Reduction in text representative to be used for comparison enables the efficient use of sequence compression algorithms such as LCS and other string approximation methods [22,16].…”
Section: Introductionmentioning
confidence: 99%
“…Sooner than a threshold based method is implemented in [3], a characteristic based technique is described in the [2] for executing the deduplication in databases. Unlike from the other approaches, Elhadi M et al [4]implemented a process based on combined part of speech and improved longest common subsequence. With reference to the above researches, in this paper an artificial neural network based deduplication technique is described.…”
Section: Review Of Related Workmentioning
confidence: 99%
“…The proposed similarity method which is based on the combination of string and semantic similarity measures outperforms the individual similarity measures with the F-measure of 99.1% in Restaurant dataset is indicated by the experimental results. In order to detect duplicate records more effectively, semantic similarity should be considered other than string similarity based on experimental results.Elhadi M et al [4] have planned method that bring information on experiments performed to investigate the use of a combined part of speech (POS) and an improved longest common subsequence (LCS) in the analysis and calculation of similarity between texts. For the representation of documents, the text's syntactical structures were used.…”
Section: Review Of Related Workmentioning
confidence: 99%
“…This method requested and positioned the archives utilizing POS labels. Elhadi and Al-Tobi (2009) enhanced the system of copy recognition (Elhadi and Al-Tobi, 2008) utilizing longest common Subsequence (LCS) to compute the closeness between the reports and positioned them as indicated by the most significant separated archives. Studies, for example, Koroutchev and Cebrian (2006) compacted the sentence structure of two texts in light of a standardized Lempel-Ziv (LZ) separate technique and figure the comparability of shared topological data assumed by the compressor.…”
Section: Literature Surveymentioning
confidence: 99%