SemEval 2016 Task 11: Complex Word Identification

Paetzold, Gustavo Henrique; Specia, Lucia

doi:10.18653/v1/s16-1085

Cited by 140 publications

(231 citation statements)

References 6 publications

Supporting

Mentioning

220

Contrasting

Unclassified

Order By: Relevance

“…Previous competitive approaches to complex word identification are many times based on word frequency thresholding as we implement here (see (Wrobel, 2016) who obtained the best F-score in the recent Complex Word Identification task (Paetzold and Specia, 2016))…”

Section: Discussionmentioning

confidence: 99%

Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems

2017

View full text Add to dashboard Cite

Section: Discussionmentioning

confidence: 99%

Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems

2017

View full text Add to dashboard Cite

“…We also discuss results here for a system which used Simple English Wikipedia word frequencies, though we did not submit it to the challenge (for consistency, we denote it SimpleBag). The task consisted of a training data set (N = 2,237), which was available during development, and a test data set (N = 88,221) on which the competition was scored and the labels were only released after the competition (Paetzold and Specia, 2016). We use both data sets here to analyze the performance of the classifiers.…”

Section: Methodsmentioning

confidence: 99%

Pomona at SemEval-2016 Task 11: Predicting Word Complexity Based on Corpus Frequency

Kauchak¹

2016

Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

View full text Add to dashboard Cite

We introduce a word frequency-based classifier for the SemEval 2016 complex word identification task (#11). Words with lower frequency are predicted as complex based on a threshold optimized for G-score. We examine three different corpora for calculating frequencies and find English Wikipedia to perform best (ranked 13th on the SemEval task), followed by the Google Web Corpus and lastly Simple English Wikipedia. Bagging is also shown to slightly improve the performance of the classifier. Overall, we find word frequency to be a strong predictor of complexity. On the SemEval "test" set, a frequency classifier that uses the optimal frequency threshold performs on-par with the best submitted system and a system trained using only 500 labeled examples split from the test set achieves results that are only slightly below the best submitted system.

show abstract

“…As a consequence, none of the systems that participated in the SemEval task managed to beat the accuracy of the "All Simple" baseline which labeled all words in the test set as simple (0.953). As noted by Paetzold and Specia (2016), the inverse problem is present in the corpus developed by Shardlow (2013b), where the "All Complex" baseline 6 The word2vec training parameters we use are a context window of size 3, learning rate alpha from 0.025 to 0.0001, minimum word count 100, sampling parameter 1e −4 , 10 negative samples per target word, and 5 training epochs. achieved higher accuracy, recall and F-scores than all other tested systems, suggesting that marking all words in a sentence as complex is the most effective approach for CWI.…”

Section: Complex Word Identificationmentioning

confidence: 99%

Simplification Using Paraphrases and Context-Based Lexical Substitution

Kriz¹,

Miltsakaki²,

Apidianaki³

et al. 2018

Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu

View full text Add to dashboard Cite

Lexical simplification involves identifying complex words or phrases that need to be simplified, and recommending simpler meaningpreserving substitutes that can be more easily understood. We propose a complex word identification (CWI) model that exploits both lexical and contextual features, and a simplification mechanism which relies on a wordembedding lexical substitution model to replace the detected complex words with simpler paraphrases. We compare our CWI and lexical simplification models to several baselines, and evaluate the performance of our simplification system against human judgments. The results show that our models are able to detect complex words with higher accuracy than other commonly used methods, and propose good simplification substitutes in context. They also highlight the limited contribution of context features for CWI, which nonetheless improve simplification compared to contextunaware models.

show abstract

SemEval 2016 Task 11: Complex Word Identification

Cited by 140 publications

References 6 publications

Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems

Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems

Pomona at SemEval-2016 Task 11: Predicting Word Complexity Based on Corpus Frequency

Simplification Using Paraphrases and Context-Based Lexical Substitution

Contact Info

Product

Resources

About