Neural CRF Model for Sentence Alignment in Text Simplification

Jiang, Chao; Maddela, Mounica; Lan, Wuwei; Yang, Zhong; Xu, Wei

doi:10.18653/v1/2020.acl-main.709

Cited by 67 publications

(52 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Preliminary We establish that our Transformer architecture choice is strong on the more standard Generic TS task, as it performs comparably to the state-of-the-art 2 (Jiang et al, 2020) on the Newsela-Auto corpus ( level tokens as side constraints (Scarton and Specia, 2018).…”

Section: Model Configurationsmentioning

confidence: 83%

“…For each of these target grades, we obtain ratings of system outputs and reference from five Amazon Mechanical Turk workers. Following prior annotation protocols (Jiang et al, 2020), we ask workers to rate outputs on three dimensions: a) is the output grammatical? We compute the absolute difference ("AbsDiff") in the simplicity ratings between the reference and the system output by the same annotator, and aggregate over all examples and all ratings.…”

Section: Human Evaluationmentioning

confidence: 99%

See 1 more Smart Citation

A Non-Autoregressive Edit-Based Approach to Controllable Text Simplification

Agrawal¹,

Xu²,

Carpuat³

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

We introduce a new approach for the task of Controllable Text Simplification, where systems rewrite a complex English sentence so that it can be understood by readers at different grade levels in the US K-12 system. It uses a non-autoregressive model to iteratively edit an input sequence and incorporates lexical complexity information seamlessly into the refinement process to generate simplifications that better match the desired output complexity than strong autoregressive baselines. Analysis shows that our model's local edit operations are combined to achieve more complex simplification operations such as content deletion and paraphrasing, as well as sentence splitting.

show abstract

Section: Model Configurationsmentioning

confidence: 83%

Section: Human Evaluationmentioning

confidence: 99%

A Non-Autoregressive Edit-Based Approach to Controllable Text Simplification

Agrawal¹,

Xu²,

Carpuat³

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

show abstract

“…Recently proposed datasets, such as WikiManual (Jiang et al, 2020), as shown in Figure 1f, have an approximately consistent distribution, and their simplifications are less conservative. Based on a visual inspection on the uppermost values of the distribution (≈80%), we can tell that often most of the information in the original sentence is removed or the target simplification does not express accurately the original meaning.…”

Section: Edit Distance Distributionmentioning

confidence: 99%

“…In the last decade, TS research has relied on Wikipedia-based datasets (Zhang and Lapata, 2017;Xu et al, 2016;Jiang et al, 2020), despite their known limitations (Xu et al, 2015;Alva-Manchego et al, 2020a) such as questionable sentence pairs alignments, inaccurate simplifications and a limited variety of simplification modifications. Apart from affecting the reliability of models trained on these datasets, their low quality influences the evaluation relying on automatic metrics that requires goldstandard simplifications, such as SARI (Xu et al, 2016) and BLEU (Papineni et al, 2001).…”

Section: Introductionmentioning

confidence: 99%

Investigating Text Simplification Evaluation

Vásquez-Rodríguez¹,

Shardlow²,

Przybya³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

Modern text simplification (TS) heavily relies on the availability of gold standard data to build machine learning models. However, existing studies show that parallel TS corpora contain inaccurate simplifications and incorrect alignments. Additionally, evaluation is usually performed by using metrics such as BLEU or SARI to compare system output to the gold standard. A major limitation is that these metrics do not match human judgements and the performance on different datasets and linguistic phenomena vary greatly. Furthermore, our research shows that the test and training subsets of parallel datasets differ significantly. In this work, we investigate existing TS corpora, providing new insights that will motivate the improvement of existing state-ofthe-art TS evaluation methods. Our contributions include the analysis of TS corpora based on existing modifications used for simplification and an empirical study on TS models performance by using better-distributed datasets. We demonstrate that by improving the distribution of TS datasets, we can build more robust TS models.

show abstract

“…We experimented with three dictionaries: PanLex (Kamholz et al, 2014), MUSE (Conneau et al, 2018), and Wikipedia parallel titles. We extract parallel article titles in Wikipedia based on the inter-language links and the entities based on the Wikidata (Jiang et al, 2020). 5 The dictionaries of PanLex, MUSE, Wikipedia contain 24K, 44K, 2M entries, respectively, and on overage 4.6, 1.4 and 1 translations per entry (English or Arabic).…”

Section: Training Datamentioning

confidence: 99%

An Empirical Study of Pre-trained Transformers for Arabic Information Extraction

Lan¹,

Yang²,

Xu³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

View full text Add to dashboard Cite

Multilingual pre-trained Transformers, such as mBERT (Devlin et al., 2019) and XLM-RoBERTa (Conneau et al., 2020a), have been shown to enable the effective cross-lingual zero-shot transfer. However, their performance on Arabic information extraction (IE) tasks is not very well studied. In this paper, we pre-train a customized bilingual BERT, dubbed GigaBERT, that is designed specifically for Arabic NLP and English-to-Arabic zero-shot transfer learning. We study Giga-BERT's effectiveness on zero-short transfer across four IE tasks: named entity recognition, part-of-speech tagging, argument role labeling, and relation extraction. Our best model significantly outperforms mBERT, XLM-RoBERTa, and AraBERT (Antoun et al., 2020) in both the supervised and zero-shot transfer settings. We have made our pre-trained models publicly available at https://github.com/ lanwuwei/GigaBERT.

show abstract

Neural CRF Model for Sentence Alignment in Text Simplification

Cited by 67 publications

References 34 publications

A Non-Autoregressive Edit-Based Approach to Controllable Text Simplification

A Non-Autoregressive Edit-Based Approach to Controllable Text Simplification

Investigating Text Simplification Evaluation

An Empirical Study of Pre-trained Transformers for Arabic Information Extraction

Contact Info

Product

Resources

About