The frequency with which we use different words changes all the time, and every so often, a new lexical item is invented or another one ceases to be used. Beyond a small sample of lexical items whose properties are well studied, little is known about the dynamics of lexical evolution. How do the lexical inventories of languages, viewed as entire systems, evolve? Is the rate of evolution of the lexicon contingent upon historical factors or is it driven by regularities, perhaps to do with universals of cognition and social interaction? We address these questions using the Google Books N-Gram Corpus as a source of data and relative entropy as a measure of changes in the frequency distributions of words. It turns out that there are both universals and historical contingencies at work. Across several languages, we observe similar rates of change, but only at timescales of at least around five decades. At shorter timescales, the rate of change is highly variable and differs between languages. Major societal transformations as well as catastrophic events such as wars lead to increased change in frequency distributions, whereas stability in society has a dampening effect on lexical evolution.
Abstract. Current research efforts in Named Entity Recognition deal mostly with the English language. Even though the interest in multilanguage Information Extraction is growing, there are only few works reporting results for the Russian language. This paper introduces quality baselines for the Russian NER task. We propose a corpus which was manually annotated with organization and person names. The main purpose of this corpus is to provide gold standard for evaluation. We implemented and evaluated two approaches to NER: knowledge-based and statistical. The first one comprises several components: dictionary matching, pattern matching and rule-based search of lexical representations of entity names within a document. We assembled a set of linguistic resources and evaluated their impact on performance. For the data-driven approach we utilized our implementation of a linear-chain CRF which uses a rich set of features. The performance of both systems is promising (62.17% and 75.05% F1 measure), although they do not employ morphological or syntactical analysis.
In linguistic theory, there is no common point of view on the question of whether verbs in aspectual pairs are in inflectional or derivational relations. At the same time, the prefix and suffix methods of forming aspectual pairs are contrasted in this respect. The publications (e.g. Janda Lyashevskaya 2011) pointed out the need to develop new quantitative approaches to this aspect of the text corpus. We propose two new approaches that compare the quantitative characteristics of aspectual pairs of both types. One approach is based on the Google Books Ngram corpus and analyzes the dynamics of the frequency of the use of words in pairs. The aspectual pairs from the databases created by Janda and Lyashevskaya are considered. For a numerical assessment of the degree of proximity of the frequency graphs, the Pearson correlation coefficients were used. The second approach introduces a numerical characteristic of the semantic proximity of verbs in pairs using modern computer methods. Semantic proximity of verbs is calculated as a standard cosine measure between vectors representing the compatibility of the considered verbs in the corpus. Several computer models and text corpora are considered. Both proposed approaches did not reveal significant numerical differences in semantic proximity between verbs in aspectual pairs with prefix and suffix pairing. This is in good agreement with the results of an early study by Janda and Lyashevskaya (2011). Together with the results of this work, our research shows that the suffixal and affixal ways of forming aspectual pairs have an equal status in terms of their classification as inflectional or derivational.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.