A deep network model for paraphrase detection in short text messages

Agarwal, Basant; Ramampiaro, Heri; Langseth, Helge; Ruocco, Massimiliano

doi:10.1016/j.ipm.2018.06.005

Cited by 105 publications

(69 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…4 Ferreira et al (2018) [44] 74.08 83. 1 Agarwal et al (2018) [4] 77.7 84.5 Arora and Kansal (2019) [45] 79.0 − Our model 78.3 84.8…”

Section: Msrp Datasetmentioning

confidence: 78%

“…For instance, the AskUbuntu dataset [27] contains very few annotations, thus limiting the generalization performance of the model [13]. The ability to augment the data with additional sound annotations without requiring human intervention can improve the performance of deep models [4,13]. Such data augmentation has been shown to be fruitful for data analytics when only a piece of limited ordinal information about the pairwise distance between objects is provided [28,29,30].…”

Section: Related Workmentioning

confidence: 99%

“…We employ a set of NLP/linguistic features in our experiments as it has been shown that including linguistic features for paraphrase identification in short text can improve the performance of deep learning models [4]. We identify the following linguistic and statistical features to be used alongside learned features in our multi-cascaded model.…”

Section: Linguistic Featuresmentioning

confidence: 99%

“…crowd-sourcing) is costly [16]. Therefore, [4] and [13] add to the training set each labeled pair also in the reversed order. However, this simple data augmentation strategy can be extended in a systematic manner by relying upon set and graph theory.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts

Shakeel

Karim

Khan

2020

Information Processing & Management

View full text Add to dashboard Cite

Paraphrase detection is an important task in text analytics with numerous applications such as plagiarism detection, duplicate question identification, and enhanced customer support helpdesks. Deep models have been proposed for representing and classifying paraphrases. These models, however, require large quantities of human-labeled data, which is expensive to obtain. In this work, we present a data augmentation strategy and a multi-cascaded model for improved paraphrase detection in short texts. Our data augmentation strategy considers the notions of paraphrases and non-paraphrases as binary relations over the set of texts. Subsequently, it uses graph theoretic concepts to efficiently generate additional paraphrase and non-paraphrase pairs in a sound manner. Our multi-cascaded model employs three supervised feature learners (cascades) based on CNN and LSTM networks with and without soft-attention. The learned features, together with hand-crafted linguistic features, are then forwarded to a discriminator network for final classification. Our model is both wide and deep and provides greater robustness across clean and noisy short texts. We evaluate our approach on three benchmark datasets and show that it produces a comparable or state-of-the-art performance on all three.• We present an efficient strategy for augmenting existing paraphrase and non-paraphrase annotations in a consistent manner. This strategy generates additional annotations and enhances the performance of the data-hungry deep learning models.• We develop a multi-cascaded learning model for robust paraphrase detection in both clean and noisy texts. This model incorporates multiple learned and linguistic features in a wide and deep discriminator network for paraphrase detection.• We address both clean and noisy texts in our presentation and show that the proposed model matches current best performances on benchmark datasets of both types.• We analyze the impact of various data augmentation steps and different components of the multicascaded model on paraphrase detection performance.

show abstract

“…4 Ferreira et al (2018) [44] 74.08 83. 1 Agarwal et al (2018) [4] 77.7 84.5 Arora and Kansal (2019) [45] 79.0 − Our model 78.3 84.8…”

Section: Msrp Datasetmentioning

confidence: 78%

Section: Related Workmentioning

confidence: 99%

Section: Linguistic Featuresmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts

Shakeel

Karim

Khan

2020

Information Processing & Management

View full text Add to dashboard Cite

show abstract

“…The model computes the average of vectors of all words and n-grams in a sentence. Several studies have shown successful results produced by Sent2vec [32,33]. The model learns a context embedding v w and target embedding u w for each word w in the vocabulary, with h number of embedding dimensions.…”

Section: Sent2vecmentioning

confidence: 99%

Deep Contextualized Pairwise Semantic Similarity for Arabic Language Questions

Al-Bataineh

Farhan

Mustafa

et al. 2019

2019 IEEE 31st International Conference on Tools With Artificial Intelligence (ICTAI)

View full text Add to dashboard Cite

Question semantic similarity is a challenging and active research problem that is very useful in many NLP applications, such as detecting duplicate questions in community question answering platforms such as Quora. Arabic is considered to be an under-resourced language, has many dialects, and rich in morphology. Combined together, these challenges make identifying semantically similar questions in Arabic even more difficult. In this paper, we introduce a novel approach to tackle this problem, and test it on two benchmarks; one for Modern Standard Arabic (MSA), and another for the 24 major Arabic dialects. We are able to show that our new system outperforms state-of-the-art approaches by achieving 93% F1-score on the MSA benchmark and 82% on the dialectical one. This is achieved by utilizing contextualized word representations (ELMo embeddings) trained on a text corpus containing MSA and dialectic sentences. This in combination with a pairwise fine-grained similarity layer, helps our question-to-question similarity model to generalize predictions on different dialects while being trained only on question-to-question MSA data.

show abstract

A lightweight semantic‐enhanced interactive network for efficient short‐text matching

Xue

Luo

et al. 2022

Asso for Info Science & Tech

View full text Add to dashboard Cite

Knowledge‐enhanced short‐text matching has been a significant task attracting much attention in recent years. However, the existing approaches cannot effectively balance effect and efficiency. Effective models usually consist of complex network structures leading to slow inference speed and the difficulties of applications in actual practice. In addition, most knowledge‐enhanced models try to link the mentions in the text to the entities of the knowledge graphs—the difficulties of entity linking decrease the generalizability among different datasets. To address these problems, we propose a lightweight Semantic‐Enhanced Interactive Network (SEIN) model for efficient short‐text matching. Unlike most current research, SEIN employs an unsupervised method to select WordNet's most appropriate paraphrase description as the external semantic knowledge. It focuses on integrating semantic information and interactive information of text while simplifying the structure of other modules. We conduct intensive experiments on four real‐world datasets, that is, Quora, Twitter‐URL, SciTail, and SICK‐E. Compared with state‐of‐the‐art methods, SEIN achieves the best performance on most datasets. The experimental results proved that introducing external knowledge could effectively improve the performance of the short‐text matching models. The research sheds light on the role of lightweight models in leveraging external knowledge to improve the effect of short‐text matching.

show abstract

A deep network model for paraphrase detection in short text messages

Cited by 105 publications

References 20 publications

A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts

A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts

Deep Contextualized Pairwise Semantic Similarity for Arabic Language Questions

A lightweight semantic‐enhanced interactive network for efficient short‐text matching

Contact Info

Product

Resources

About