CAMB at CWI Shared Task 2018: Complex Word Identification with
            Ensemble-Based Voting

Gooding, Sian; Kochmar, Ekaterina

doi:10.18653/v1/w18-0520

Cited by 42 publications

(58 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Results: We report the results obtained with the sequence labelling (SEQ) model for the binary task and compare them to the current state-of-the-art in complex word identification, CAMB system by Gooding and Kochmar (2018), which achieved the best results across all binary and two probabilistic tracks in the CWI 2018 shared task (Yimam et al, 2018). The evaluation metric reported is the macro-averaged F1, as was used in the 2018 CWI shared task (Yimam et al, 2018).…”

Section: Resultsmentioning

confidence: 99%

“…The CAMB system considers words irrespective of their context and relies on 27 features of various types, encoding lexical, syntactic, frequencybased and other types of information about individual words. The system uses Random Forests and AdaBoost for classification, but as Gooding and Kochmar (2018) report, the choice of the features, algorithm and training data depends on the genre. In addition, phrase classification is performed using a 'greedy' approach and simply labelling all phrases as complex.…”

Section: Resultsmentioning

confidence: 99%

“…In this paper, we use the data from the CWI 2018 shared task, which contains annotation for both words and word sequences (called phrases in the task), and represents three different genres of text. We focus on the binary setting (complex vs. non-complex) and compare our results to the winning system by Gooding and Kochmar (2018).…”

Section: Wordmentioning

confidence: 99%

“…First of all, CWI systems typically address this task on a word-by-word basis, using a large number of features to capture the complexity of a word. For instance, the CWI system by Paetzold and Specia (2016c) uses a total of 69 features, while the one by Gooding and Kochmar (2018) uses 27 features. Secondly, systems performing CWI in a static manner are unable to take the context into account, thus failing to predict word complexity for polysemous words as well as words in various metaphorical or novel contexts.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Complex Word Identification as a Sequence Labelling Task

Gooding¹,

Kochmar²

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Self Cite

View full text Add to dashboard Cite

Complex Word Identification (CWI) is concerned with detection of words in need of simplification and is a crucial first step in a simplification pipeline. It has been shown that reliable CWI systems considerably improve text simplification. However, most CWI systems to date address the task on a word-byword basis, not taking the context into account. In this paper, we present a novel approach to CWI based on sequence modelling. Our system is capable of performing CWI in context, does not require extensive feature engineering and outperforms state-of-the-art systems on this task.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Section: Wordmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Complex Word Identification as a Sequence Labelling Task

Gooding¹,

Kochmar²

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Self Cite

View full text Add to dashboard Cite

show abstract

“…For the Second CWI Shared Task (Yimam et al, 2018), participants built monolingual models using the datasets previously described, and also tested their cross-lingual capabilities on newly collected French data. In the monolingual track, the best systems for English (Gooding and Kochmar, 2018) differed significantly in terms of feature set size and the model's complexity, to the best systems for German and Spanish (Kajiwara and Komachi, 2018). The latter used Random Forests with eight features, whilst the former used Ad-aBoost with 5000 estimators or ensemble voting combining AdaBoost and Random Forest classifiers, with about 20 features.…”

Section: Introductionmentioning

confidence: 99%

Strong Baselines for Complex Word Identification across Multiple Languages

Finnimore¹,

Fritzsch²,

King³

et al. 2019

Proceedings of the 2019 Conference of the North

View full text Add to dashboard Cite

Complex Word Identification (CWI) is the task of identifying which words or phrases in a sentence are difficult to understand by a target audience. The latest CWI Shared Task released data for two settings: monolingual (i.e. train and test in the same language) and crosslingual (i.e. test in a language not seen during training). The best monolingual models relied on language-dependent features, which do not generalise in the cross-lingual setting, while the best cross-lingual model used neural networks with multi-task learning. In this paper, we present monolingual and cross-lingual CWI models that perform as well as (or better than) most models submitted to the latest CWI Shared Task. We show that carefully selected features and simple learning models can achieve state-of-the-art performance, and result in strong baselines for future development in this area. Finally, we discuss how inconsistencies in the annotation of the data can explain some of the results obtained.

show abstract

Predicting lexical complexity in English texts: the Complex 2.0 dataset

Shardlow

Evans

Zampieri

2022

Lang Resources & Evaluation

View full text Add to dashboard Cite

Identifying words which may cause difficulty for a reader is an essential step in most lexical text simplification systems prior to lexical substitution and can also be used for assessing the readability of a text. This task is commonly referred to as complex word identification (CWI) and is often modelled as a supervised classification problem. For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required. In this paper we analyze previous work carried out in this task and investigate the properties of CWI datasets for English. We develop a protocol for the annotation of lexical complexity and use this to annotate a new dataset, CompLex 2.0. We present experiments using both new and old datasets to investigate the nature of lexical complexity. We found that a Likert-scale annotation protocol provides an objective setting that is superior for identifying the complexity of words compared to a binary annotation protocol. We release a new dataset using our new protocol to promote the task of Lexical Complexity Prediction.

show abstract

CAMB at CWI Shared Task 2018: Complex Word Identification with Ensemble-Based Voting

Cited by 42 publications

References 23 publications

Complex Word Identification as a Sequence Labelling Task

Complex Word Identification as a Sequence Labelling Task

Strong Baselines for Complex Word Identification across Multiple Languages

Predicting lexical complexity in English texts: the Complex 2.0 dataset

Contact Info

Product

Resources

About