Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.709
|View full text |Cite
|
Sign up to set email alerts
|

Neural CRF Model for Sentence Alignment in Text Simplification

Abstract: The success of a text simplification system heavily depends on the quality and quantity of complex-simple sentence pairs in the training corpus, which are extracted by aligning sentences between parallel articles. To evaluate and improve sentence alignment quality, we create two manually annotated sentence-aligned datasets from two commonly used text simplification corpora, Newsela and Wikipedia. We propose a novel neural CRF alignment model which not only leverages the sequential nature of sentences in parall… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
51
0
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
4

Relationship

1
7

Authors

Journals

citations
Cited by 67 publications
(52 citation statements)
references
References 34 publications
0
51
0
1
Order By: Relevance
“…Preliminary We establish that our Transformer architecture choice is strong on the more standard Generic TS task, as it performs comparably to the state-of-the-art 2 (Jiang et al, 2020) on the Newsela-Auto corpus ( level tokens as side constraints (Scarton and Specia, 2018).…”
Section: Model Configurationsmentioning
confidence: 83%
See 1 more Smart Citation
“…Preliminary We establish that our Transformer architecture choice is strong on the more standard Generic TS task, as it performs comparably to the state-of-the-art 2 (Jiang et al, 2020) on the Newsela-Auto corpus ( level tokens as side constraints (Scarton and Specia, 2018).…”
Section: Model Configurationsmentioning
confidence: 83%
“…For each of these target grades, we obtain ratings of system outputs and reference from five Amazon Mechanical Turk workers. Following prior annotation protocols (Jiang et al, 2020), we ask workers to rate outputs on three dimensions: a) is the output grammatical? We compute the absolute difference ("AbsDiff") in the simplicity ratings between the reference and the system output by the same annotator, and aggregate over all examples and all ratings.…”
Section: Human Evaluationmentioning
confidence: 99%
“…Recently proposed datasets, such as WikiManual (Jiang et al, 2020), as shown in Figure 1f, have an approximately consistent distribution, and their simplifications are less conservative. Based on a visual inspection on the uppermost values of the distribution (≈80%), we can tell that often most of the information in the original sentence is removed or the target simplification does not express accurately the original meaning.…”
Section: Edit Distance Distributionmentioning
confidence: 99%
“…In the last decade, TS research has relied on Wikipedia-based datasets (Zhang and Lapata, 2017;Xu et al, 2016;Jiang et al, 2020), despite their known limitations (Xu et al, 2015;Alva-Manchego et al, 2020a) such as questionable sentence pairs alignments, inaccurate simplifications and a limited variety of simplification modifications. Apart from affecting the reliability of models trained on these datasets, their low quality influences the evaluation relying on automatic metrics that requires goldstandard simplifications, such as SARI (Xu et al, 2016) and BLEU (Papineni et al, 2001).…”
Section: Introductionmentioning
confidence: 99%
“…We experimented with three dictionaries: PanLex (Kamholz et al, 2014), MUSE (Conneau et al, 2018), and Wikipedia parallel titles. We extract parallel article titles in Wikipedia based on the inter-language links and the entities based on the Wikidata (Jiang et al, 2020). 5 The dictionaries of PanLex, MUSE, Wikipedia contain 24K, 44K, 2M entries, respectively, and on overage 4.6, 1.4 and 1 translations per entry (English or Arabic).…”
Section: Training Datamentioning
confidence: 99%