2021
DOI: 10.48550/arxiv.2101.08382
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ParaSCI: A Large Scientific Paraphrase Dataset for Longer Paraphrase Generation

Abstract: We propose ParaSCI, the first large-scale paraphrase dataset in the scientific field, including 33,981 paraphrase pairs from ACL (ParaSCI-ACL) and 316,063 pairs from arXiv (ParaSCI-arXiv). Digging into characteristics and common patterns of scientific papers, we construct this dataset though intra-paper and inter-paper methods, such as collecting citations to the same paper or aggregating definitions by scientific terms. To take advantage of sentences paraphrased partially, we put up PDBERT as a general paraph… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 18 publications
0
1
0
Order By: Relevance
“…All pairs are manually labeled to be either paraphrases or nonparaphrases. ParaSCI (Dong, Wan, and Cao 2021) contains 350K automatically extracted paraphrase candidates from ACL and arXiv papers. The extraction heuristics consider term definitions, citation information, and sentence embedding similarity.…”
Section: Paraphrase Datasets For Englishmentioning
confidence: 99%
“…All pairs are manually labeled to be either paraphrases or nonparaphrases. ParaSCI (Dong, Wan, and Cao 2021) contains 350K automatically extracted paraphrase candidates from ACL and arXiv papers. The extraction heuristics consider term definitions, citation information, and sentence embedding similarity.…”
Section: Paraphrase Datasets For Englishmentioning
confidence: 99%