Simon Hengchen scite author profile

State-of-the-art models of lexical semantic change detection suffer from noise stemming from vector space alignment. We have empirically tested the Temporal Referencing method for lexical semantic change and show that, by avoiding alignment, it is less affected by this noise. We show that, trained on a diachronic corpus, the skip-gram with negative sampling architecture with temporal referencing outperforms alignment models on a synthetic task as well as a manual testset. We introduce a principled way to simulate lexical semantic change and systematically control for possible biases. * The order has been randomly determined and all authors contributed equally to this work.

show abstract

SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection

Schlechtweg¹,

McGillivray²,

Hengchen³

et al. 2020

106

View full text Add to dashboard Cite

Lexical Semantic Change detection, i.e., the task of identifying words that change meaning over time, is a very active research area, with applications in NLP, lexicography, and linguistics. Evaluation is currently the most pressing problem in Lexical Semantic Change detection, as no gold standards are available to the community, which hinders progress. We present the results of the first shared task that addresses this gap by providing researchers with an evaluation framework and manually annotated, high-quality datasets for English, German, Latin, and Swedish. 33 teams submitted 186 systems, which were evaluated on two subtasks.

show abstract

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

Hill

Hengchen

2019

View full text Add to dashboard Cite

This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.

show abstract

From the Paft to the Fiiture: a Fully Automatic NMT andWord Embeddings Method for OCR Post-Correction

Hämäläinen¹,

Hengchen

2019

View full text Add to dashboard Cite

A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a characterbased sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.

show abstract

GASC: Genre-Aware Semantic Change for Ancient Greek

Perrone¹,

Palma²,

Hengchen³

et al. 2019

View full text Add to dashboard Cite

Word meaning changes over time, depending on linguistic and extra-linguistic factors. Associating a word's correct meaning in its historical context is a central challenge in diachronic research, and is relevant to a range of NLP tasks, including information retrieval and semantic search in historical texts. Bayesian models for semantic change have emerged as a powerful tool to address this challenge, providing explicit and interpretable representations of semantic change phenomena. However, while corpora typically come with rich metadata, existing models are limited by their inability to exploit contextual information (such as text genre) beyond the document timestamp. This is particularly critical in the case of ancient languages, where lack of data and long diachronic span make it harder to draw a clear distinction between polysemy (the fact that a word has several senses) and semantic change (the process of acquiring, losing, or changing senses), and current systems perform poorly on these languages. We develop GASC, a dynamic semantic change model that leverages categorical metadata about the texts' genre to boost inference and uncover the evolution of meanings in Ancient Greek corpora. In a new evaluation framework, our model achieves improved predictive performance compared to the state of the art.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Simon Hengchen

Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change

SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

From the Paft to the Fiiture: a Fully Automatic NMT andWord Embeddings Method for OCR Post-Correction

GASC: Genre-Aware Semantic Change for Ancient Greek

Contact Info

Product

Resources

About