We (Team Skurt) propose a simple method to detect lexical semantic change by clustering contextualized embeddings produced by XLM-R, using K-Means++. The basic idea is that contextualized embeddings that encode the same sense are located in close proximity in the embedding space. Our approach is both simple and generic, but yet performs relatively well in both sub-tasks of SemEval-2020 Task 1. We hypothesize that the main shortcoming of our method lies in the simplicity of the clustering method used.
A reasonable requirement (among many others) for a lexical or semantic component in an information system is that it should be able to learn incrementally from the linguistic data it is exposed to, that it can distinguish between the topical impact of various terms, and that it knows if it knows stuff or not. We work with a specific representation framework – semantic spaces – which well accommodates the first requirement; in this short paper, we investigate the global qualities of semantic spaces by a topological procedure – mapper – which gives an indication of topical density of the space; we examine the local context of terms of interest in the semantic space using another topologically inspired approach which gives an indication of the neighbourhood of the terms of interest. Our aim is to be able to establish the qualities of the semantic space under consideration without resorting to inspection of the data used to build it.
Abstract. Machine learning offers significant benefits for systems that process and understand natural language: a) lower maintenance and upkeep costs than when using manually-constructed resources, b) easier portability to new domains, tasks, or languages, and c) robust and timely adaptation to situation-specific settings. However, the behaviour of an adaptive system is less predictable than when using an edited, stable resource, which makes quality control a continuous issue. This paper proposes an evaluation benchmark for measuring the quality, coverage, and stability of a natural language system as it learns word meaning. Inspired by existing tests for human vocabulary learning, we outline measures for the quality of semantic word representations, such as when learning word embeddings or other distributed representations. These measures highlight differences between the types of underlying learning processes as systems ingest progressively more data.
This paper investigates data-driven segmentation using RePair or Byte Pair Encoding-techniques. In contrast to previous work which has primarily been focused on subword units for machine translation, we are interested in the general properties of such segments above the word level. We call these segments r-grams, and discuss their properties and the effect they have on the token frequency distribution. The proposed approach is evaluated by demonstrating its viability in embedding techniques, both in monolingual and multilingual test settings. We also provide a number of qualitative examples of the proposed methodology, demonstrating its viability as a language-invariant segmentation procedure.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.