In developing technologies for code-switched speech, it would be desirable to be able to predict how much language mixing might be expected in the signal and the regularity with which it might occur. In this work, we offer various metrics that allow for the classification and visualization of multilingual corpora according to the ratio of languages represented, the probability of switching between them, and the time-course of switching. Applying these metrics to corpora of different languages and genres, we find that they display distinct probabilities and periodicities of switching, information useful for speech processing of mixed-language data.
One of the benefits of language identification that is particularly relevant for code-switching (CS) research is that it permits insight into how the languages are mixed (i.e., the level of integration of the languages). The aim of this paper is to quantify and visualize the nature of the integration of languages in CS documents using simple language-independent metrics that can be adopted by linguists. In our contribution, we (a) make a linguistic case for classifying CS types according to how the languages are integrated; (b) describe our language identification system; (c) introduce an Integration-index (I-index) derived from HMM transition probabilities; (d) employ methods for visualizing integration via a language signature (or switching profile); and (e) illustrate the utility of our simple metrics for linguists as applied to Spanish-English texts of different switching profiles.
Aims and objectives: This study aims to redress the paucity of research on the semantics of loanwords, by extending and empirically testing Backus’s ((2001). The role of semantic specificity in insertional codeswitching: Evidence from Dutch-Turkish. Jacobson, Rodolfo (Hg): Codeswitching Worldwide. Bd, 2, 125–154) Specificity Hypothesis – ‘Embedded language elements in code-switching have a high degree of semantic specificity’ (p. 128). Approach: Adopting a concept-based approach to examine loanwords in a large, reliable corpus, the study pursues the following question: Do loanwords have a high degree of semantic specificity relative to their receiving-language equivalents? Specificity is operationalized as an entropy measure of the target word’s environment, the assumption being that more specific words have less variety in their surrounding context. Data and analysis: To test this hypothesis, Anglicisms in a 24-million-word newspaper corpus of Argentine Spanish were processed in three stages: detecting loanwords, selecting semantic equivalents, and measuring specificity. Findings/conclusions: A Wilcoxon Signed-Rank Test revealed that loanwords receive significantly lower entropy scores, that is, they are more specific than their Spanish equivalents. The results suggest a possible motive for adopting loanwords when terms already exist in the source language, namely, to utilize words that provide more nuanced meaning. Originality: Methodologically, this study offers innovative applications of computational methods to loanword research, employing a distributional model to measure entropy. Theoretically, it addresses an underrepresented aspect of loanword adoption, semantics, by extending Backus’s hypothesis to loanwords and increasing its scope to data often viewed as ‘monolingual’. Significance/implications: The conclusions offer novel perspectives on loanwords with existing semantic equivalents, often viewed as ‘unnecessary’ when compared to loanwords that introduce new concepts into the recipient language (e.g. blog). With the notion of specificity, we may understand these loanwords as disruptors to the semantic system of the recipient language, dividing up the semantic space formerly occupied solely by the native equivalent, thus increasing the level of nuance expressed in the original concept.
Traditionally, automated methods for loanword detection have not received an abundance of attention within the field of language contact. However, as research on loanwords has begun utilizing corpora with word counts in the millions, these generous quantities of data pose challenges for traditional methods of linguistic annotation. This paper presents a method for automatically detecting anglicisms within Spanish text and presents a case study, applying this method to explore the social stratification of anglicisms in Argentine media. The findings of the case study suggest that anglicisms may function as prestige markers in Argentina, which may be a logical consequence of the mode of contact: those of upper socio-economic status have greater access to outlets where loanwords seem to emerge, such as the media, Internet, and second language education.
One language is often assumed to be dominant in code-switching (C-S), but this assumption has not been empirically tested. We operationalize the matrix language (ML) at the level of the sentence, using three common definitions. We test whether these converge and then model this convergence via a set of metrics that together quantify the nature of C-S. We conduct our experiment on four different Spanish-English corpora. Our results demonstrate that our model can separate some corpora according to whether they have a dominant ML or not but that the corpora span a range of mixing types that cannot be sorted neatly into an insertional vs. alternational dichotomy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.