Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora

Gonen, Hila; Jawahar, Ganesh; Seddah, Djamé; Goldberg, Yoav

doi:10.18653/v1/2020.acl-main.51

Cited by 37 publications

(37 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More broadly, prior work by Wendlandt et al (2018), Antoniak andMimno (2018), andGonen et al (2020), among others, has also shown embedding stability to be a concern in models trained on larger corpora than those used in this work. However, the role of random embedding effects on previous qualitative studies using word embeddings (e.g., Kulkarni et al (2015), Hamilton et al (2016)) has not been evaluated.…”

Section: Confidence Estimation In Embedding Analysismentioning

confidence: 86%

“…This method also avoids the conceptual difficulties and low replicability of comparing embedding spaces numerically (e.g. by cosine distances) (Gonen et al, 2020). However, even nearest neighborhoods are often unstable, and vary dramatically across runs of the same embedding algorithm on the same corpus (Wendlandt et al, 2018;Antoniak and Mimno, 2018).…”

Section: Identifying Stable Embeddings For Analysismentioning

confidence: 99%

“…However, quantitative, vector-based comparison of embedding spaces faces significant conceptual challenges, such as a lack of appropriate alignment objectives and empirical instability (Gonen et al, 2020). While nearest neighbor-based change measurement has been proposed (Newman-Griffis and Fosler-Lussier, 2019;Gonen et al, 2020), its efficacy for small corpora with limited vocabularies remains to be determined. Our novel embedding confidence measure offers a step in this direction (see §6.3 for further discussion), but further research is needed.…”

Section: Mining Shifts In the Literaturementioning

confidence: 99%

See 2 more Smart Citations

TextEssence: A Tool for Interactive Analysis of Semantic Shifts Between Corpora

Newman-Griffis¹,

Sivaraman²,

Perer³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Embeddings of words and concepts capture syntactic and semantic regularities of language; however, they have seen limited use as tools to study characteristics of different corpora and how they relate to one another. We introduce TextEssence, an interactive system designed to enable comparative analysis of corpora using embeddings. TextEssence includes visual, neighbor-based, and similarity-based modes of embedding analysis in a lightweight, web-based interface. We further propose a new measure of embedding confidence based on nearest neighborhood overlap, to assist in identifying high-quality embeddings for corpus analysis. A case study on COVID-19 scientific literature illustrates the utility of the system. TextEssence can be found at https: //textessence.github.io.

show abstract

Section: Confidence Estimation In Embedding Analysismentioning

confidence: 86%

Section: Identifying Stable Embeddings For Analysismentioning

confidence: 99%

Section: Mining Shifts In the Literaturementioning

confidence: 99%

See 1 more Smart Citation

TextEssence: A Tool for Interactive Analysis of Semantic Shifts Between Corpora

Newman-Griffis¹,

Sivaraman²,

Perer³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

show abstract

“…This approach builds on recent advancements in the study of language change using distributed temporal word embeddings (Bianchi et al., 2020; Di Carlo et al., 2019; Hilpert & Perek, 2015; Kim et al., 2014; Kulkarni et al., 2015; Perek, 2014; Sagi et al., 2011; Yao et al., 2018). Finally, we implemented and evaluated three different ways of capturing language change, which target both item‐ and neighborhood‐level patterns of change in usage patterns (Gonen et al., 2020; Hamilton et al., 2016a), offering a more thorough and insightful characterization of the phenomenon at hand.…”

Section: Discussionmentioning

confidence: 99%

“…We use three different methods to quantify language change 9 (Gonen et al., 2020; Hamilton et al., 2016a). In order to ensure comparability across words, we only computed measures of language change for words that appeared at least 25 times in each of the six slices in which the corpus was divided, such that reliable lexical representations could be learned.…”

Section: Methodsmentioning

confidence: 99%

Words with Consistent Diachronic Usage Patterns are Learned Earlier: A Computational Analysis Using Temporally Aligned Word Embeddings

2021

View full text Add to dashboard Cite

In this study, we use temporally aligned word embeddings and a large diachronic corpus of English to quantify language change in a data‐driven, scalable way, which is grounded in language use. We show a unique and reliable relation between measures of language change and age of acquisition (AoA) while controlling for frequency, contextual diversity, concreteness, length, dominant part of speech, orthographic neighborhood density, and diachronic frequency variation. We analyze measures of language change tackling both the change in lexical representations and the change in the relation between lexical representations and the words with the most similar usage patterns, showing that they capture different aspects of language change. Our results show a unique relation between language change and AoA, which is stronger when considering neighborhood‐level measures of language change: Words with more coherent diachronic usage patterns tend to be acquired earlier. The results support theories positing a link between ontogenetic and ethnogenetic processes in language.

show abstract