Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) 2016
DOI: 10.18653/v1/s16-1085
|View full text |Cite
|
Sign up to set email alerts
|

SemEval 2016 Task 11: Complex Word Identification

Abstract: We report the findings of the Complex Word Identification task of SemEval 2016. To create a dataset, we conduct a user study with 400 non-native English speakers, and find that complex words tend to be rarer, less ambiguous and shorter. A total of 42 systems were submitted from 21 distinct teams, and nine baselines were provided. The results highlight the effectiveness of Decision Trees and Ensemble methods for the task, but ultimately reveal that word frequencies remain the most reliable predictor of word com… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

4
220
1
1

Year Published

2016
2016
2023
2023

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 140 publications
(231 citation statements)
references
References 6 publications
4
220
1
1
Order By: Relevance
“…Previous competitive approaches to complex word identification are many times based on word frequency thresholding as we implement here (see (Wrobel, 2016) who obtained the best F-score in the recent Complex Word Identification task (Paetzold and Specia, 2016))…”
Section: Discussionmentioning
confidence: 99%
“…Previous competitive approaches to complex word identification are many times based on word frequency thresholding as we implement here (see (Wrobel, 2016) who obtained the best F-score in the recent Complex Word Identification task (Paetzold and Specia, 2016))…”
Section: Discussionmentioning
confidence: 99%
“…We also discuss results here for a system which used Simple English Wikipedia word frequencies, though we did not submit it to the challenge (for consistency, we denote it SimpleBag). The task consisted of a training data set (N = 2,237), which was available during development, and a test data set (N = 88,221) on which the competition was scored and the labels were only released after the competition (Paetzold and Specia, 2016). We use both data sets here to analyze the performance of the classifiers.…”
Section: Methodsmentioning
confidence: 99%
“…As a consequence, none of the systems that participated in the SemEval task managed to beat the accuracy of the "All Simple" baseline which labeled all words in the test set as simple (0.953). As noted by Paetzold and Specia (2016), the inverse problem is present in the corpus developed by Shardlow (2013b), where the "All Complex" baseline 6 The word2vec training parameters we use are a context window of size 3, learning rate alpha from 0.025 to 0.0001, minimum word count 100, sampling parameter 1e −4 , 10 negative samples per target word, and 5 training epochs. achieved higher accuracy, recall and F-scores than all other tested systems, suggesting that marking all words in a sentence as complex is the most effective approach for CWI.…”
Section: Complex Word Identificationmentioning
confidence: 99%