Cross-Lingual Transfer Learning for Complex Word Identification

Zaharia, George-Eduard; Cercel, Dumitru-Clementin; Dascălu, Mihai

doi:10.1109/ictai50040.2020.00067

Cited by 5 publications

(8 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Zaharia et al [189] experimented with several transformer-based models, such as Multilingual BERT (mBERT) [126] and XLM-RoBERTa [41], for cross-lingual CWI. Both mBERT and XLM-RoBERTa are multilingual masked language models that are pretrained on numerous languages.…”

Section: Lexical Complexity Prediction In Languages Other Than Englishmentioning

confidence: 99%

See 1 more Smart Citation

Lexical Complexity Prediction: An Overview

2023

View full text Add to dashboard Cite

The occurrence of unknown words in texts significantly hinders reading comprehension. To improve accessibility for specific target populations, computational modelling has been applied to identify complex words in texts and substitute them for simpler alternatives. In this paper, we present an overview of computational approaches to lexical complexity prediction focusing on the work carried out on English data. We survey relevant approaches to this problem which include traditional machine learning classifiers (e.g. SVMs, logistic regression) and deep neural networks as well as a variety of features, such as those inspired by literature in psycholinguistics as well as word frequency, word length, and many others. Furthermore, we introduce readers to past competitions and available datasets created on this topic. Finally, we include brief sections on applications of lexical complexity prediction, such as readability and text simplification, together with related studies on languages other than English.

show abstract

Section: Lexical Complexity Prediction In Languages Other Than Englishmentioning

confidence: 99%

“…XLM-RoBERTa is also pretrained on 100 languages, yet with more data [41]. Zaharia et al [189] tested these models performance on the WikiNews datasets provided by CWI-2018 [185]. They found that XLM-RoBERTa was the best performing model.…”

Section: Lexical Complexity Prediction In Languages Other Than Englishmentioning

confidence: 99%

Lexical Complexity Prediction: An Overview

2023

View full text Add to dashboard Cite

show abstract

“…The Oracle functions best when applied to multiple solutions, by jointly using them to obtain a final prediction. At the same time, Zaharia et al (2020) explored the power of Transformer-based models (Vaswani et al, 2017) in cross-lingual environments by using different training scenarios, depending on the scarcity of the resources: zero-shot, one-shot, as well as few-shot learning. Moreover, CWI can be also approached as a probabilistic task.…”

Section: Newsmentioning

confidence: 99%

Domain Adaptation in Multilingual and Multi-Domain Monolingual Settings for Complex Word Identification

Zaharia¹,

Smădu²,

Cercel³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Complex word identification (CWI) is a cornerstone process towards proper text simplification. CWI is highly dependent on context, whereas its difficulty is augmented by the scarcity of available datasets which vary greatly in terms of domains and languages. As such, it becomes increasingly more difficult to develop a robust model that generalizes across a wide array of input examples. In this paper, we propose a novel training technique for the CWI task based on domain adaptation to improve the target character and context representations. This technique addresses the problem of working with multiple domains, inasmuch as it creates a way of smoothing the differences between the explored datasets. Moreover, we also propose a similar auxiliary task, namely text simplification, that can be used to complement lexical complexity prediction. Our model obtains a boost of up to 2.42% in terms of Pearson Correlation Coefficients in contrast to vanilla training techniques, when considering the CompLex from the Lexical Complexity Prediction 2021 dataset. At the same time, we obtain an increase of 3% in Pearson scores, while considering a cross-lingual setup relying on the Complex Word Identification 2018 dataset. In addition, our model yields state-ofthe-art results in terms of Mean Absolute Error.

show abstract

“…Their approach based on the user's native language identifies complex terms by automatically detecting cognates and false friends, using distributional similarity computed from fastText (Bojanowski 2017: 135-146) word embeddings. Similar types of features are used in (Zaharia 2020). To calculate similarity measures between words, the authors apply a technique presented in (Conneau 2017) to learn a linear mapping of two vector spaces that represent monolingual fastText word embeddings (e.g., between Spanish and German) into the same vector space.…”

Section: Related Workmentioning

confidence: 99%

Collection and evaluation of lexical complexity data for Russian language using crowdsourcing

Абрамов

Ivanov

2022

Russian Journal of Linguistics

View full text Add to dashboard Cite

Estimating word complexity with binary or continuous scores is a challenging task that has been studied for several domains and natural languages. Commonly this task is referred to as Complex Word Identification (CWI) or Lexical Complexity Prediction (LCP). Correct evaluation of word complexity can be an important step in many Lexical Simplification pipelines. Earlier works have usually presented methodologies of lexical complexity estimation with several restrictions: hand-crafted features correlated with word complexity, performed feature engineering to describe target words with features such as number of hypernyms, count of consonants, Named Entity tag, and evaluations with carefully selected target audiences. Modern works investigated the use of transforner-based models that afford extracting features from surrounding context as well. However, the majority of papers have been devoted to pipelines for the English language and few translated them to other languages such as German, French, and Spanish. In this paper we present a dataset of lexical complexity in context based on the Russian Synodal Bible collected using a crowdsourcing platform. We describe a methodology for collecting the data using a 5-point Likert scale for annotation, present descriptive statistics and compare results with analogous work for the English language. We evaluate a linear regression model as a baseline for predicting word complexity on handcrafted features, fastText and ELMo embeddings of target words. The result is a corpus consisting of 931 distinct words that used in 3,364 different contexts.

show abstract

Cross-Lingual Transfer Learning for Complex Word Identification

Cited by 5 publications

References 27 publications

Lexical Complexity Prediction: An Overview

Lexical Complexity Prediction: An Overview

Domain Adaptation in Multilingual and Multi-Domain Monolingual Settings for Complex Word Identification

Collection and evaluation of lexical complexity data for Russian language using crowdsourcing

Contact Info

Product

Resources

About