Learning To Split and Rephrase From Wikipedia Edit History

Botha, Jan A.; Faruqui, Manaal; Alex, John; Baldridge, Jason; Das, Dipanjan

doi:10.18653/v1/d18-1080

Cited by 50 publications

(77 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Afterwards, Aharoni and Goldberg (2018) proposed the state-of-the-art model to date, a sequence-to-sequence model (Bahdanau et al, 2015) with a copy mechanism (Gu et al, 2016;See et al, 2017) with the observation that most texts are unchanged during a Split and Rephrase operation. Later, Botha et al (2018) introduced the WikiSplit corpus to be used as large but noisy training data, which the authors reported to be unsuitable as the evaluation data. Also, Sulem et al (2018) studied the problems of using BLEU as the evaluation metric for this task, while proposing a manually constructed test set called HSplit.…”

Section: Introductionmentioning

confidence: 99%

Small but Mighty: New Benchmarks for Split and Rephrase

Zhang¹,

Zhu²,

Brahma³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Split and Rephrase is a text simplification task of rewriting a complex sentence into simpler ones. As a relatively new task, it is paramount to ensure the soundness of its evaluation benchmark and metric. We find that the widely used benchmark dataset universally contains easily exploitable syntactic cues caused by its automatic generation process. Taking advantage of such cues, we show that even a simple rule-based model can perform on par with the state-of-the-art model. To remedy such limitations, we collect and release two crowdsourced benchmark datasets. We not only make sure that they contain significantly more diverse syntax, but also carefully control for their quality according to a welldefined set of criteria. While no satisfactory automatic metric exists, we apply fine-grained manual evaluation based on these criteria using crowdsourcing, showing that our datasets better represent the task and are significantly more challenging for the models. 1 * Work done during internship at IBM Research.

show abstract

Section: Introductionmentioning

confidence: 99%

Small but Mighty: New Benchmarks for Split and Rephrase

Zhang¹,

Zhu²,

Brahma³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…In order to train the model we used a corpus from the WikiSplit dataset by Google (Botha et al, 2018). This dataset was constructed automatically from the publicly available Wikipedia revision history.…”

Section: Methodsmentioning

confidence: 99%

A Computational Theory for the Emergence of Grammatical Categories in Cortical Dynamics

Demattíes

Rizzi

Thiruvathukal

et al. 2020

Front. Neural Circuits

View full text Add to dashboard Cite

“…Moreover, to demonstrate domain independence, we compared the output generated by our TS approach with that of the various baseline systems on the Newsela corpus (Xu et al, 2015), which is composed of 1077 sentences from newswire articles. In addition, we assessed the performance of our simplification system using the 5000 test sentences from the WikiSplit benchmark (Botha et al, 2018), which was mined from Wikipedia edit histories.…”

Section: Methodsmentioning

confidence: 99%

“…Though outperforming the models used in Narayan et al (2017), they still perform poorly compared to previous state-of-the-art rule-based syntactic simplification approaches. In addition, Botha et al (2018) observed that the sentences from the WebSplit corpus contain fairly unnatural linguistic expressions using only a small vocabulary. To overcome this limitation, they present a scalable, languageagnostic method for mining training data from Wikipedia edit histories, providing a rich and varied vocabulary over naturally expressed sentences and their extracted splits.…”

Section: Data-driven Approachesmentioning

confidence: 99%

Transforming Complex Sentences into a Semantic Hierarchy

Niklaus¹,

Cetto²,

Freitas³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

We present an approach for recursively splitting and rephrasing complex English sentences into a novel semantic hierarchy of simplified sentences, with each of them presenting a more regular structure that may facilitate a wide variety of artificial intelligence tasks, such as machine translation (MT) or information extraction (IE). Using a set of hand-crafted transformation rules, input sentences are recursively transformed into a twolayered hierarchical representation in the form of core sentences and accompanying contexts that are linked via rhetorical relations. In this way, the semantic relationship of the decomposed constituents is preserved in the output, maintaining its interpretability for downstream applications. Both a thorough manual analysis and automatic evaluation across three datasets from two different domains demonstrate that the proposed syntactic simplification approach outperforms the state of the art in structural text simplification. Moreover, an extrinsic evaluation shows that when applying our framework as a preprocessing step the performance of state-of-the-art Open IE systems can be improved by up to 346% in precision and 52% in recall. To enable reproducible research, all code is provided online.

show abstract

Learning To Split and Rephrase From Wikipedia Edit History

Cited by 50 publications

References 20 publications

Small but Mighty: New Benchmarks for Split and Rephrase

Small but Mighty: New Benchmarks for Split and Rephrase

A Computational Theory for the Emergence of Grammatical Categories in Cortical Dynamics

Transforming Complex Sentences into a Semantic Hierarchy

Contact Info

Product

Resources

About