We propose a quantitative approach for quantifying morphological complexity of a language based on text. Several corpus-based methods have focused on measuring the different word forms that a language can produce. We take into account not only the productivity of morphological processes but also the predictability of those morphological processes. We use a language model that predicts the probability of sub-word sequences within a word; we calculate the entropy rate of this model and use it as a measure of predictability of the internal structure of words. Our results show that it is important to integrate these two dimensions when measuring morphological complexity, since languages can be complex under one measure but simpler under another one. We calculated the complexity measures in two different parallel corpora for a typologically diverse set of languages. Our approach is corpus-based and it does not require the use of linguistic annotated data.Entropy 2020, 22, 48 2 of 19 the statistical language models used in natural language processing (NLP), which are a useful tool for estimating a probability distribution over sequences of words within a language. However, we adapt this notion to the sub-word level. Information theory-based measures (entropy) can be used to estimate the predictiveness of these models. Previous WorkDespite the different approaches and definitions of linguistic complexity, there are some main distinctions between the absolute and the relative complexity [3]. The former is defined in terms of the number of parts of a linguistic system; and the latter (more subjective) is related to the cost and difficulty faced by language users. Another important distinction includes global complexity that characterizes entire languages, e.g., as easy or difficult to learn. In contrast, particular complexity focuses only in a specific language level, e.g., phonological, morphological, syntactic.In the case of morphology, languages of the world have different word production processes. Therefore, the amount of semantic and grammatical information encoded at the word level, may vary significantly from language to language. In this sense, it is important to quantify the morphological richness of languages and how it varies depending on their linguistic typology. Ackerman and Malouf [9] highlight two different dimensions that must be taken into account: the enumerative (e-complexity) that focuses on delimiting the inventories of language elements (number of morphosyntactic categories in a language and how they are encoded in a word); and the integrative complexity (i-complexity) that focuses on examining the systematic organization underlying the surface patterns of a language (difficulty of the paradigmatic system).Coterell et al.[10] investigate a trade-off between the e-complexity and i-complexity of morphological systems. The authors propose a measure based on the size of a paradigm but also on how hard is to jointly predict all the word forms in a paradigm from the lemma. They conclude that "a morpholog...
RESUMEN. Este artículo aborda la flexión verbal con base en los conceptos introducidos en el modelo morfológico basado en palabras; en específico, el modelo de Word and Paradigm. Se propone una metodología para el análisis de la flexión verbal del español. Para esto, tomamos únicamente la primera conjugación del español. El análisis aquí establecido se basa en determinar partes principales que permitan predecir los paradigmas verbales de primera conjugación. Mostramos que a partir de unas cuantas partes principales es posible predecir el paradigma completo de una forma verbal. Asimismo discutimos los fenómenos de la competencia entre esquemas donde se presenta cambio de acento (como en las formas "hablemos" y "háblemos") así como las diptongaciones /o/ → /ue/ y /e/ → /ie/ que responden a un cambio en la parte principal. Finalmente, se muestran las ventajas que el modelo basado en palabras muestra en el análisis de la flexión.Palabras clave: Flexión; morfología basada en palabras; Word and Paradigm; partes principales.ABSTRACT. This article deals with verbal inflection based on concepts introduced in the word-based morphological model; in particular, the Word and Paradigm model. We propose a methodology for the analysis of verbal inflection of Spanish. For this, we take only the first conjugation of Spanish. The analysis here is based on determining the main parts that allow prediction of verbal paradigms of first conjugation. We show that from a few main parts it is possible to obtain the complete paradigm of a verbal form. We also discuss the phenomena of competition between schemas where there is a change of accent ("hablemos" and "háblemos" case) as well as the diphthongizations /o/ → /ue/ and /e/ → /ie/. This phenomenon responds to the modification of the principal part. Finally, the advantages that the word-based model presents in the inflection analysis are shown.
In this work we focus on the task of automatically extracting bilingual lexicon for the language pair Spanish-Nahuatl. This is a low-resource setting where only a small amount of parallel corpus is available. Most of the downstream methods do not work well under low-resources conditions. This is specially true for the approaches that use vectorial representations like Word2Vec. Our proposal is to construct bilingual word vectors from a graph. This graph is generated using translation pairs obtained from an unsupervised word alignment method. We show that, in a low-resource setting, these type of vectors are successful in representing words in a bilingual semantic space. Moreover, when a linear transformation is applied to translate words from one language to another, our graph based representations considerably outperform the popular setting that uses Word2Vec.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.