Paraphrase type identification for plagiarism detection using contexts and word embeddings

Alvi, Faisal; Stevenson, Mark; Clough, Paul

doi:10.1186/s41239-021-00277-8

Cited by 16 publications

(13 citation statements)

References 43 publications

(70 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Dey et al [9] applied a Support Vector Machine (SVM) classifier to identify semantically similar tweets and other short texts. A very recent work studied word embedding models for paraphrase sentence pairs with word reordering and synonym substitution [1]. In this work, we focus on detecting paraphrases without access to pairs as it represents a realistic scenario without pair information.…”

Section: Related Workmentioning

confidence: 99%

Identifying Machine-Paraphrased Plagiarism

Wahle

Ruas²,

Foltýnek³

et al. 2023

Preprint

View full text Add to dashboard Cite

Employing paraphrasing tools to conceal plagiarized text is a severe threat to academic integrity. To enable the detection of machine-paraphrased text, we evaluate the effectiveness of five pre-trained word embedding models combined with machine learning classifiers and state-of-the-art neural language models. We analyze preprints of research papers , graduation theses, and Wikipedia articles, which we paraphrased using different configurations of the tools SpinBot and SpinnerChief. The best performing technique, Longformer, achieved an average F1 score of 80.99% (F1=99.68% for SpinBot and F1=71.64% for Spinner-Chief cases), while human evaluators achieved F1=78.4% for SpinBot and F1=65.6% for SpinnerChief cases. We show that the automated classification alleviates shortcomings of widely-used text-matching systems , such as Turnitin and PlagScan. To facilitate future research, all data 3 , code 4 , and two web applications 56 showcasing our contributions are openly available.

show abstract

Section: Related Workmentioning

confidence: 99%

Identifying Machine-Paraphrased Plagiarism

Wahle

Ruas²,

Foltýnek³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Furthermore, Corbeil and Ghadivel have used three BERT models and XLNET for uninterrupted paraphrasing detection to the MSRP corpus, achieving 85.8%-91.5% of F1 results (Corbeil & Ghadivel, 2020). Alvi et al used CS corpus and Con-ceptNet Numberbatch pre-trained word embeddings (Alvi et al, 2021). They reported an F1 score of 90.6% for identifying word reorderings and an F1 score of 80.2% for identifying synonymous substitutions for the entire dataset.…”

Section: Related Workmentioning

confidence: 99%

Comparison study of unsupervised paraphrase detection: Deep learning—The key for semantic similarity detection

Vrbanec

Meštrović

2023

Expert Systems

View full text Add to dashboard Cite

Automatic detection of concealed plagiarism in the form of paraphrases is a difficult task, and finding a successful unsupervised approach for paraphrase detection is necessary as a precondition to change that. This comparative study identified the most efficient methods for unsupervised paraphrased document detection using similarity measures alone or combined with Deep Learning (DL) models. It proved the hypothesis that some DL models are more successful than the best statistically-based methods in that task. Many experiments were carried out, and their results were compared. The text similarities between documents are obtained from 60 different methods using five paraphrase corpora, including the new one made by authors, as an important original contribution. Some DL models achieved significantly better results than those obtained by the best statistical methods, especially pre-trained transformer-based language models with average values of Accuracy and F1 of 85.8% and 88.3%, respectively, with top values of 99.9% and 98.4% for Accuracy and F1 on some corpora. These results are even better than those of supervised and combined approaches. Therefore, here presented results prove that detecting concealed plagiarism becomes an attainable goal. This study highlighted those language models with the best overall results for paraphrase detection as best suited for further research. The study also discussed the choice of similarity/distance measure paired with embeddings produced by DL models and some advantages of using cosine similarity as the fastest measure. For 60 different methods, complexity has been defined in O notation. Times needed for their implementation have also been presented. The article's results and conclusions are a firm base for future semantic similarity, paraphrasing, and plagiarism detection studies, clearly marking state-ofthe-art tools and methods.

show abstract

“…It helps to determine the cosine angle. When the result is bound to [0, 1], cosine similarity is particularly effective [19]. The cosine similarity of the two vectors in the same orientation is 1, and the relative 90 orientation is 0.…”

Section: Cosine Similaritymentioning

confidence: 99%

“…They used PAN-PC-11 and PAN-14 datasets for training and testing purposes, respectively. Alvi et al [19] proposed a paraphrase identification approach and plagiarism detection tool based on contexts and word embeddings. Son et al [20] proposed a plagiarism detection approach using feature extraction techniques which is based on multi-layer long-short term memory (LSTM) networks.…”

Section: Introductionmentioning

confidence: 99%

An improved extrinsic monolingual plagiarism detection approach of the Bengali text

Ahnaf

Hasan

Sworna

et al. 2023

IJECE

View full text Add to dashboard Cite

Plagiarism is an act of literature fraud, which is presenting others’ work or ideas without giving credit to the original work. All published and unpublished written documents are under the cover of this definition. Plagiarism, which increased significantly over the last few years, is a concerning issue for students, academicians, and professionals. Due to this, there are several plagiarism detection tools or software available to detect plagiarism in different languages. Unfortunately, negligible work has been done and no plagiarism detection software available in the Bengali language where Bengali is one of the most spoken languages in the world. In this paper, we have proposed a plagiarism detection tool for the Bengali language that mainly focuses on the educational and newspaper domain. We have collected 82 textbooks from the National Curriculum of Textbooks (NCTB), Bangladesh, scrapped all articles from 12 reputed newspapers and compiled our corpus with more than 10 million sentences. The proposed method on Bengali text corpus shows an accuracy rate of 97.31%

show abstract

Paraphrase type identification for plagiarism detection using contexts and word embeddings

Cited by 16 publications

References 43 publications

Identifying Machine-Paraphrased Plagiarism

Identifying Machine-Paraphrased Plagiarism

Comparison study of unsupervised paraphrase detection: Deep learning—The key for semantic similarity detection

An improved extrinsic monolingual plagiarism detection approach of the Bengali text

Contact Info

Product

Resources

About