Duplicate Detection in Documents and WebPages Using Improved Longest Common Subsequence and Documents Syntactical Structures

Elhadi, Mohamed; Al-Tobi, Amjad

doi:10.1109/iccit.2009.235

Cited by 23 publications

(20 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, reduction of text to its syntactical structures reduces the dimensionality of the document allowing us to deal with much shorter strings instead of the full text. Such reduction minimizes information loss when compared with processing based on mere text of characters or group of words as practiced by various n-gram, shingle-based techniques [20,21] and IR in general. Reduction in text representative to be used for comparison enables the efficient use of sequence compression algorithms such as LCS and other string approximation methods [22,16].…”

Section: Introductionmentioning

confidence: 99%

Arabic Text Copy Detection using Full, Reduced and Unique Syntactical Structures

Elhadi¹

2016

IJCA

View full text Add to dashboard Cite

This paper reports on work performed to investigate the use of a combined Part of Speech (POS) tagging and a minimum edit operations algorithm to determine the level of similarity between pairs of Arabic text documents. The level of similarity can be used as an indication of duplication in full or in part of the document's content. Text is first converted into POS tags that are then fed to the string similarity algorithm to determine the similarity of pairs of documents. A normalized score is calculated and used to rank documents. Documents ranked higher than some selected threshold are considered similar and can be near or complete duplicate. The performed experiments compare results based on the use of a set of selected common subsequences that are the results of translation of text into a sequence of syntactical units. The strings are first produced using full-text (FULL). These are further refined to produce a REDUCED; where repeated consecutive characters are reduced to a single character and a number, and more refined to produce a UNIQUE string; where all repeating characters are replaced by a single character. Syntactical features of the text were used as a structural representation of the documents' content. Results obtained from the experiments using the FULL, the REDUCED and the UNIQUE POS-strings showed a clear advantage over the use of the plain text in terms of reduced string size while maintaining the same discrimination power. In particular the unique (most-reduced) string has shown quite comparable results to the reduced, the full and the actual text string.

show abstract

Section: Introductionmentioning

confidence: 99%

Arabic Text Copy Detection using Full, Reduced and Unique Syntactical Structures

Elhadi¹

2016

IJCA

View full text Add to dashboard Cite

show abstract

“…Sooner than a threshold based method is implemented in [3], a characteristic based technique is described in the [2] for executing the deduplication in databases. Unlike from the other approaches, Elhadi M et al [4]implemented a process based on combined part of speech and improved longest common subsequence. With reference to the above researches, in this paper an artificial neural network based deduplication technique is described.…”

Section: Review Of Related Workmentioning

confidence: 99%

“…The proposed similarity method which is based on the combination of string and semantic similarity measures outperforms the individual similarity measures with the F-measure of 99.1% in Restaurant dataset is indicated by the experimental results. In order to detect duplicate records more effectively, semantic similarity should be considered other than string similarity based on experimental results.Elhadi M et al [4] have planned method that bring information on experiments performed to investigate the use of a combined part of speech (POS) and an improved longest common subsequence (LCS) in the analysis and calculation of similarity between texts. For the representation of documents, the text's syntactical structures were used.…”

Section: Review Of Related Workmentioning

confidence: 99%

A Technique for Data Deduplication using Q-Gram Concept with Support Vector Machine

Padmanaban¹,

Bhuvaneswari²

2013

IJCA

View full text Add to dashboard Cite

Several systems that rely on consistent data to offer high quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi-replicas, or near-duplicate entries in their repositories. Because of that, there have been significant investments from private and government organizations in developing methods for removing replicas from its data repositories.In this paper, we have proposed accordingly. In the previous work, duplicate record detection was done using three different similarity measures and neural network. In the previous work, we have generated feature vector based on similarity measures and then, neural network was used to find the duplicate records. In this paper, we have developed Q-gram concept with support vector machine for deduplication process. The similarity function, which we are used Dice coefficient,Damerau-Levenshtein distance,Tversky index for similarity measurement. Finally, support vector machine is used for testing whether data record is duplicate or not. A set of data generated from some similarity measures are used as the input to the proposed system. There are two processes which characterize the proposed deduplication technique, the training phase and the testing phase the experimental results showed that the proposed deduplication technique has higher accuracy than the existing method. The accuracy obtained for the proposed deduplication 88%.

show abstract

“…This method requested and positioned the archives utilizing POS labels. Elhadi and Al-Tobi (2009) enhanced the system of copy recognition (Elhadi and Al-Tobi, 2008) utilizing longest common Subsequence (LCS) to compute the closeness between the reports and positioned them as indicated by the most significant separated archives. Studies, for example, Koroutchev and Cebrian (2006) compacted the sentence structure of two texts in light of a standardized Lempel-Ziv (LZ) separate technique and figure the comparability of shared topological data assumed by the compressor.…”

Section: Literature Surveymentioning

confidence: 99%

SVM significant role selection method for improving semantic text plagiarism detection

Osman¹,

Barukab

2017

Int. j. adv. appl. sci.

View full text Add to dashboard Cite

This research introduces an approach for the prediction and detection of plagiarized text based on Semantic Role Labelling (SRL) and Support Vector Machine (SVM). The introduced method evaluates and analyses text based on semantic position for each term within the text. It additionally detects the source semantic sense in considering the connections between its terms using the Semantic Role Labeling (SRL). SRL presents noteworthy remuneration while creating roles from a text semantically. Selecting for every role created by the SVM method keeping in mind the end goal to foresee significant roles is a noteworthy part of the proposed system. The imperative roles that will vote by the SVM strategy will be chosen in the comparability computation process. The proposed strategy assessed utilizing the PAN-PC-10 dataset. The outcomes proved that the introduced strategy enhanced the execution as far as the assessment measures contrasted and other plagiarism detection methods.

show abstract

Duplicate Detection in Documents and WebPages Using Improved Longest Common Subsequence and Documents Syntactical Structures

Cited by 23 publications

References 21 publications

Arabic Text Copy Detection using Full, Reduced and Unique Syntactical Structures

Arabic Text Copy Detection using Full, Reduced and Unique Syntactical Structures

A Technique for Data Deduplication using Q-Gram Concept with Support Vector Machine

SVM significant role selection method for improving semantic text plagiarism detection

Contact Info

Product

Resources

About