Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.611
|View full text |Cite
|
Sign up to set email alerts
|

PARADE: A New Dataset for Paraphrase Identification Requiring Computer Science Domain Knowledge

Abstract: We present a new benchmark dataset called PARADE for paraphrase identification that requires specialized domain knowledge. PA-RADE contains paraphrases that overlap very little at the lexical and syntactic level but are semantically equivalent based on computer science domain knowledge, as well as nonparaphrases that overlap greatly at the lexical and syntactic level but are not semantically equivalent based on this domain knowledge. Experiments show that both state-of-the-art neural models and non-expert huma… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
13
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(13 citation statements)
references
References 34 publications
(58 reference statements)
0
13
0
Order By: Relevance
“…We use two public datasets, PARADE (He et al, 2020a) and clinical-STS2019 , to evaluate the model performance. PARADE is a computer science domain benchmark dataset for paraphrase identification, while clinicalSTS2019 belongs to the biomedical domain.…”
Section: Datasetsmentioning
confidence: 99%
See 3 more Smart Citations
“…We use two public datasets, PARADE (He et al, 2020a) and clinical-STS2019 , to evaluate the model performance. PARADE is a computer science domain benchmark dataset for paraphrase identification, while clinicalSTS2019 belongs to the biomedical domain.…”
Section: Datasetsmentioning
confidence: 99%
“…PARADE is a computer science domain benchmark dataset for paraphrase identification, while clinicalSTS2019 belongs to the biomedical domain. For PARADE dataset, we use the same training, validation, testing splits with He et al (2020a). In clinicalSTS2019, the similarity score of each sentence pair ranges from 0 to 5, where 0 indicates irrelevance, and 5 indicates the equivalence in semantic meanings between the two sentences.…”
Section: Datasetsmentioning
confidence: 99%
See 2 more Smart Citations
“…Neural network based models have been proposed for the supervised PI task, and achieve decent performance in the single-domain setting (Yin and Schütze, 2015;Wang et al, 2017;Yang et al, 2019). At present, the existing PI corpora are restricted to several particular domains (Dolan et al, 2004;Xu et al, 2014;He et al, 2020), while the practical sentence pair for the paraphrase judgment can be from any unlabeled domain. At the same time, building a PI corpus for a novel domain needs massive human effort and is expensive.…”
Section: Introductionmentioning
confidence: 99%