Automatic Detection of Intra-Word Code-Switching

Nguyen, Dong-Phuong; Cornips, L.

doi:10.18653/v1/w16-2013

Cited by 13 publications

(13 citation statements)

References 67 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Their language identifier obtained 98.9% precision when classifying texts of four "screen lines" between 19 languages. Nguyen and Cornips (2016) used odds ratio to identify the language of parts of words when identifying between two languages. Odds ratio for language g when compared with language h for morph f i is calculated as in Equation 36.…”

Section: Perplexitymentioning

confidence: 99%

Automatic Language Identification in Texts: A Survey

Jauhiainen

Lui²,

Zampieri³

et al. 2019

jair

102

View full text Add to dashboard Cite

Language identification ("LI") is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. Today, LI is a key part of many text processing pipelines, as text processing techniques generally assume that the language of the input text is known. Research in this area has recently been especially active. This article provides a brief history of LI research, and an extensive survey of the features and methods used in the LI literature. We describe the features and methods using a unified notation, to make the relationships between methods clearer. We discuss evaluation methods, applications of LI, as well as off-the-shelf LI systems that do not require training by the end user. Finally, we identify open issues, survey the work to date on each issue, and propose future directions for research in LI.LI as a task predates computational methods -the earliest interest in the area was motivated by the needs of translators, and simple manual methods were developed to quickly identify documents in specific languages. The earliest known work to describe a functional LI program for text is by Mustonen (1965), a statistician, who used multiple discriminant analysis to teach a computer how to distinguish, at the word level, between English, Swedish and Finnish. Mustonen compiled a list of linguistically-motivated character-based features, and trained his language identifier on 300 words for each of the three target languages. The training procedure created two discriminant functions, which were tested with 100 words for each language. The experiment resulted in 76% of the words being correctly classified; even by current standards this percentage would be seen as acceptable given the small amount of training material, although the composition of training and test data is not clear, making the experiment unreproducible.In the early 1970s, Nakamura (1971) considered the problem of automatic LI. According to Rau (1974) and the available abstract of Nakamura's article, 1 his language identifier was able to distinguish between 25 languages written with the Latin alphabet. As features, the method used the occurrence rates of characters and words in each language. From the abstract it seems that, in addition to the frequencies, he used some binary presence/absence features of particular characters or words, based on manual LI. Rau (1974) wrote his master's thesis "Language Identification by Statistical Analysis" for the Naval Postgraduate School at Monterey, California. The continued interest and the need to use LI of text in military intelligence settings is evidenced by the recent articles of, for example, Rafidha Rehiman et al. (2013), Rowe et al. (2013), and Voss et al. (2014. As features for LI, Rau (1974) used, e.g., the relative frequencies of characters and character bigrams. With a majority vote classifier ensemble of seven classifiers using Kolmogor-Smirnov's Test of Goodness of Fit and Yule's characteristic (K), he managed...

show abstract

Section: Perplexitymentioning

confidence: 99%

Automatic Language Identification in Texts: A Survey

Jauhiainen

Lui²,

Zampieri³

et al. 2019

jair

102

View full text Add to dashboard Cite

show abstract

“…For Nepali-English, Barman et al (2014) correctly identified some of the mixed words with a combination of linear kernel support vector machines and a k-nearest neighbour approach. The most similar work to ours is Nguyen and Cornips (2016), which focused on detecting intra-word CS for Dutch-Limburgish (Nguyen et al, 2015). The authors utilized Morfessor (Creutz and Lagus, 2002) to segment all words into morphemes and Wikipedia to assign LID probabilities to each morpheme.…”

Section: Related Workmentioning

confidence: 96%

Subword-Level Language Identification for Intra-Word Code-Switching

Mager

Çetinoğlu

Kann

2019

Proceedings of the 2019 Conference of the North

View full text Add to dashboard Cite

Language identification for code-switching (CS), the phenomenon of alternating between two or more languages in conversations, has traditionally been approached under the assumption of a single language per token. However, if at least one language is morphologically rich, a large number of words can be composed of morphemes from more than one language (intra-word CS). In this paper, we extend the language identification task to the subword level, such that it includes splitting mixed words while tagging each part with a language ID. We further propose a model for this task, which is based on a segmental recurrent neural network. In experiments on a new Spanish-Wixarika dataset and on an adapted German-Turkish dataset, our proposed model performs slightly better than or roughly on par with our best baseline, respectively. Considering only mixed words, however, it strongly outperforms all baselines.

show abstract

“…Notably, Solorio and Liu (2008) trained classifiers to predict code-switching points in Spanish and English, using different learning algorithms and transcriptions of code-switched discourse, while Nguyen and Dogruöz (2013) focused on wordlevel language identification (in Dutch-Turkish news commentary). Nguyen and Cornips (2016) describe work on analyzing and detecting intra-word codemixing by first segmenting words into smaller units and later identifying words composed of sequences of subunits associated with different languages in tweets (posts on the Twitter social-media site).…”

Section: Introductionmentioning

confidence: 99%

Language Identification in Code-Switched Text Using Conditional Random Fields and Babelnet

Sikdar

Gambäck

2016

Proceedings of the Second Workshop on Computational Approaches to Code Switching

View full text Add to dashboard Cite

The paper outlines a supervised approach to language identification in code-switched data, framing this as a sequence labeling task where the label of each token is identified using a classifier based on Conditional Random Fields and trained on a range of different features, extracted both from the training data and by using information from Babelnet and Babelfy. The method was tested on the development dataset provided by organizers of the shared task on language identification in codeswitched data, obtaining tweet level monolingual, code-switched and weighted F1-scores of 94%, 85% and 91%, respectively, with a token level accuracy of 95.8%. When evaluated on the unseen test data, the system achieved 90%, 85% and 87.4% monolingual, code-switched and weighted tweet level F1scores, and a token level accuracy of 95.7%.

show abstract

Automatic Detection of Intra-Word Code-Switching

Cited by 13 publications

References 67 publications

Automatic Language Identification in Texts: A Survey

Automatic Language Identification in Texts: A Survey

Subword-Level Language Identification for Intra-Word Code-Switching

Language Identification in Code-Switched Text Using Conditional Random Fields and Babelnet

Contact Info

Product

Resources

About