Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018
DOI: 10.18653/v1/d18-1080
|View full text |Cite
|
Sign up to set email alerts
|

Learning To Split and Rephrase From Wikipedia Edit History

Abstract: Split and rephrase is the task of breaking down a sentence into shorter ones that together convey the same meaning. We extract a rich new dataset for this task by mining Wikipedia's edit history: WikiSplit contains one million naturally occurring sentence rewrites, providing sixty times more distinct split examples and a ninety times larger vocabulary than the WebSplit corpus introduced by Narayan et al.(2017) as a benchmark for this task. Incorporating WikiSplit as training data produces a model with qualitat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
67
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 50 publications
(77 citation statements)
references
References 20 publications
0
67
0
Order By: Relevance
“…Afterwards, Aharoni and Goldberg (2018) proposed the state-of-the-art model to date, a sequence-to-sequence model (Bahdanau et al, 2015) with a copy mechanism (Gu et al, 2016;See et al, 2017) with the observation that most texts are unchanged during a Split and Rephrase operation. Later, Botha et al (2018) introduced the WikiSplit corpus to be used as large but noisy training data, which the authors reported to be unsuitable as the evaluation data. Also, Sulem et al (2018) studied the problems of using BLEU as the evaluation metric for this task, while proposing a manually constructed test set called HSplit.…”
Section: Introductionmentioning
confidence: 99%
“…Afterwards, Aharoni and Goldberg (2018) proposed the state-of-the-art model to date, a sequence-to-sequence model (Bahdanau et al, 2015) with a copy mechanism (Gu et al, 2016;See et al, 2017) with the observation that most texts are unchanged during a Split and Rephrase operation. Later, Botha et al (2018) introduced the WikiSplit corpus to be used as large but noisy training data, which the authors reported to be unsuitable as the evaluation data. Also, Sulem et al (2018) studied the problems of using BLEU as the evaluation metric for this task, while proposing a manually constructed test set called HSplit.…”
Section: Introductionmentioning
confidence: 99%
“…In order to train the model we used a corpus from the WikiSplit dataset by Google (Botha et al, 2018). This dataset was constructed automatically from the publicly available Wikipedia revision history.…”
Section: Methodsmentioning
confidence: 99%
“…Moreover, to demonstrate domain independence, we compared the output generated by our TS approach with that of the various baseline systems on the Newsela corpus (Xu et al, 2015), which is composed of 1077 sentences from newswire articles. In addition, we assessed the performance of our simplification system using the 5000 test sentences from the WikiSplit benchmark (Botha et al, 2018), which was mined from Wikipedia edit histories.…”
Section: Methodsmentioning
confidence: 99%
“…Though outperforming the models used in Narayan et al (2017), they still perform poorly compared to previous state-of-the-art rule-based syntactic simplification approaches. In addition, Botha et al (2018) observed that the sentences from the WebSplit corpus contain fairly unnatural linguistic expressions using only a small vocabulary. To overcome this limitation, they present a scalable, languageagnostic method for mining training data from Wikipedia edit histories, providing a rich and varied vocabulary over naturally expressed sentences and their extracted splits.…”
Section: Data-driven Approachesmentioning
confidence: 99%