A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains

Schlechtweg, Dominik; Hätty, Anna; Tredici, Marco Del; Walde, Sabine Schulte im

doi:10.18653/v1/p19-1072

Cited by 67 publications

(74 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…18 Applying dimension-wise mean centering has the effect of spreading the vectors across the hyperplane and mitigating the hubness issue, which consequently improves wordlevel similarity, as it emerges from the reported results. Previous work has already validated the importance of mean centering for clustering-based tasks (Suzuki et al 2013), bilingual lexicon induction with crosslingual word embeddings (Artetxe, Labaka, and Agirre 2018a;, and for modeling lexical semantic change (Schlechtweg et al 2019). However, to the best of our knowledge, the results summarized in Table 12 are the first evidence that also confirms its importance for semantic similarity in a wide array of languages.…”

Section: Resultsmentioning

confidence: 51%

Multi-SimLex: A Large-Scale Evaluation of Multilingual and Crosslingual Lexical Semantic Similarity

Vulić

Baker

Ponti

et al. 2021

Computational Linguistics

View full text Add to dashboard Cite

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity data sets. Due to its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and cross-lingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and cross-lingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised cross-lingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex -style resources for additional languages. We make these contributions - the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning - available via a website which will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.

show abstract

Section: Resultsmentioning

confidence: 51%

Multi-SimLex: A Large-Scale Evaluation of Multilingual and Crosslingual Lexical Semantic Similarity

Vulić

Baker

Ponti

et al. 2021

Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…Using different base embeddings, SGNS (Bamler and Mandt, 2017), PPMI (Yao et al, 2018), and Bernoulli embeddings (Rudolph and Blei, 2018), the results show that sharing data is beneficial regardless of the method. 1 Temporal Referencing has been applied first in the field of term extraction Ferrari et al (2017) and recently been tested for diachronic LSC detection (Schlechtweg et al, 2019).…”

Section: Related Workmentioning

confidence: 99%

“…The successful outcome of semantic change detection is relevant to any diachronic textual analysis, including machine translation or normalization of historical texts (Tjong Kim Sang et al, 2017), the detection of cultural semantic shifts (Kutuzov et al, 2017) or applications in digital humanities (Tahmasebi and Risse, 2017a). However, currently, the best-performing models (Hamilton et al, 2016b;Kulkarni et al, 2015;Schlechtweg et al, 2019) require a complex alignment procedure and have been shown to suffer from biases (Dubossarsky et al, 2017). This exposes them to various sources of noise influencing their predictions; a fact which has long gone unnoticed because of the lack of standard evaluation procedures in the field.…”

Section: Introductionmentioning

confidence: 99%

Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change

Dubossarsky¹,

Hengchen²,

Tahmasebi³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Self Cite

105

View full text Add to dashboard Cite

State-of-the-art models of lexical semantic change detection suffer from noise stemming from vector space alignment. We have empirically tested the Temporal Referencing method for lexical semantic change and show that, by avoiding alignment, it is less affected by this noise. We show that, trained on a diachronic corpus, the skip-gram with negative sampling architecture with temporal referencing outperforms alignment models on a synthetic task as well as a manual testset. We introduce a principled way to simulate lexical semantic change and systematically control for possible biases. * The order has been randomly determined and all authors contributed equally to this work.

show abstract

“…All corpora are lemmatized and POS-tagged with the TreeTagger (Schmid, 1995), and reduced to content words (nouns, verbs and adjectives). We follow the preprocessing steps described in Schlechtweg et al (2019) that led to the best results in that study. The corpus sizes are shown in Table 1.…”

Section: Data and Gold Standard Creationmentioning

confidence: 99%

Predicting Degrees of Technicality in Automatic Terminology Extraction

Hätty

Schlechtweg

Dorna

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Self Cite

View full text Add to dashboard Cite

While automatic term extraction is a wellresearched area, computational approaches to distinguish between degrees of technicality are still understudied. We semi-automatically create a German gold standard of technicality across four domains, and illustrate the impact of a web-crawled general-language corpus on predicting technicality. When defining a classification approach that combines general-language and domain-specific word embeddings, we go beyond previous work and align vector spaces to gain comparative embeddings. We suggest two novel models to exploit general-vs. domain-specific comparisons: a simple neural network model with pre-computed comparative-embedding information as input, and a multi-channel model computing the comparison internally. Both models outperform previous approaches, with the multi-channel model performing best.

show abstract

A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains

Cited by 67 publications

References 46 publications

Multi-SimLex: A Large-Scale Evaluation of Multilingual and Crosslingual Lexical Semantic Similarity

Multi-SimLex: A Large-Scale Evaluation of Multilingual and Crosslingual Lexical Semantic Similarity

Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change

Predicting Degrees of Technicality in Automatic Terminology Extraction

Contact Info

Product

Resources

About