Estimating Code-Switching on Twitter with a Novel Generalized
            Word-Level Language Detection Technique

Rijhwani, Shruti; Sequiera, Royal; Choudhury, Monojit; Bali, Kalika; Maddila, Chandra

doi:10.18653/v1/p17-1180

Cited by 76 publications

(67 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Chen and Maison (2003) used the Markovian probabilities with Witten-Bell and modified Kneser-Ney smoothing. Giwa (2016), Balažević et al (2016), and Rijhwani et al (2017) also recently used modified Kneser-Ney discounting. Barbaresi (2016) used both original and modified Kneser-Ney smoothings.…”

Section: Good-turing Discountingmentioning

confidence: 99%

“…Ueda and Nakagawa (1990) were the first to apply hidden Markov models (HMM) to LI. More recently HMMs have been used by Adouane and Dobnik (2017), Guzmán et al (2017), and Rijhwani et al (2017). Binas (2005) generated aggregate Markov models, which resulted in the best results when distinguishing between six languages, obtaining 74% accuracy with text length of ten characters.…”

Section: Neural Network ("Nn")mentioning

confidence: 99%

See 1 more Smart Citation

Automatic Language Identification in Texts: A Survey

Jauhiainen

Lui²,

Zampieri³

et al. 2019

jair

102

View full text Add to dashboard Cite

Language identification ("LI") is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. Today, LI is a key part of many text processing pipelines, as text processing techniques generally assume that the language of the input text is known. Research in this area has recently been especially active. This article provides a brief history of LI research, and an extensive survey of the features and methods used in the LI literature. We describe the features and methods using a unified notation, to make the relationships between methods clearer. We discuss evaluation methods, applications of LI, as well as off-the-shelf LI systems that do not require training by the end user. Finally, we identify open issues, survey the work to date on each issue, and propose future directions for research in LI.LI as a task predates computational methods -the earliest interest in the area was motivated by the needs of translators, and simple manual methods were developed to quickly identify documents in specific languages. The earliest known work to describe a functional LI program for text is by Mustonen (1965), a statistician, who used multiple discriminant analysis to teach a computer how to distinguish, at the word level, between English, Swedish and Finnish. Mustonen compiled a list of linguistically-motivated character-based features, and trained his language identifier on 300 words for each of the three target languages. The training procedure created two discriminant functions, which were tested with 100 words for each language. The experiment resulted in 76% of the words being correctly classified; even by current standards this percentage would be seen as acceptable given the small amount of training material, although the composition of training and test data is not clear, making the experiment unreproducible.In the early 1970s, Nakamura (1971) considered the problem of automatic LI. According to Rau (1974) and the available abstract of Nakamura's article, 1 his language identifier was able to distinguish between 25 languages written with the Latin alphabet. As features, the method used the occurrence rates of characters and words in each language. From the abstract it seems that, in addition to the frequencies, he used some binary presence/absence features of particular characters or words, based on manual LI. Rau (1974) wrote his master's thesis "Language Identification by Statistical Analysis" for the Naval Postgraduate School at Monterey, California. The continued interest and the need to use LI of text in military intelligence settings is evidenced by the recent articles of, for example, Rafidha Rehiman et al. (2013), Rowe et al. (2013), and Voss et al. (2014. As features for LI, Rau (1974) used, e.g., the relative frequencies of characters and character bigrams. With a majority vote classifier ensemble of seven classifiers using Kolmogor-Smirnov's Test of Goodness of Fit and Yule's characteristic (K), he managed...

show abstract

Section: Good-turing Discountingmentioning

confidence: 99%

Section: Neural Network ("Nn")mentioning

confidence: 99%

Automatic Language Identification in Texts: A Survey

Jauhiainen

Lui²,

Zampieri³

et al. 2019

jair

102

View full text Add to dashboard Cite

show abstract

“…speakers of both Hindi and English. As much as 17% of Indian Facebook posts (Bali et al, 2014) and 3.5% of all tweets (Rijhwani et al, 2017) are codemixed. This paper addresses fine-grained (token-level) language ID, which is needed for many multilingual downstream tasks, including syntactic analysis (Bhat et al, 2018), machine translation and dialog systems.…”

Section: Introductionmentioning

confidence: 99%

“…Some prior work has focused on identifying larger language spans in longer documents (Lui et al, 2014;Jurgens et al, 2017) or estimating proportions of multiple languages in a text (Lui et al, 2014;Kocmi and Bojar, 2017). Others have focused on token-level language ID; some work is constrained to predicting word-level labels from a single language pair (Nguyen and Dogruöz, 2013;Solorio et al, 2014;Molina et al, 2016a;Sristy et al, 2017), while others permit a handful of languages (Das and Gambäck, 2014;Sristy et al, 2017;Rijhwani et al, 2017). In contrast, CMX supports 100 languages.…”

Section: Introductionmentioning

confidence: 99%

A Fast, Compact, Accurate Model for Language Identification of Codemixed Text

Zhang¹,

Riesa²,

Gillick³

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

We address fine-grained multilingual language identification: providing a language code for every token in a sentence, including codemixed text containing multiple languages. Such text is prevalent online, in documents, social media, and message boards. We show that a feed-forward network with a simple globally constrained decoder can accurately and rapidly label both codemixed and monolingual text in 100 languages and 100 language pairs. This model outperforms previously published multilingual approaches in terms of both accuracy and speed, yielding an 800x speed-up and a 19.5% averaged absolute gain on three codemixed datasets. It furthermore outperforms several benchmark systems on monolingual language identification.

show abstract

“…These methods are language-dependent and require large annotated datasets or comprehensive dictionary of the target languages. For instance, some of the recent studies such as (Barman, Wagner, Vyas, Gella, Sharma, Bali, & Choudhury, 2014;Chrupala, & Foster, 2014;Dias Cardoso & Roy, 2016;Gella, Sharma, & Bali, 2013;Lavergne, Adda, Adda-Decker, & Lamel, 2014;Piergallini, Shirvani, Gautam, & Chouikha, 2016;Rijhwani, Sequiera, Choudhury, Bali, & Maddila, 2017;Barman, Das, Wagner, & Foster, 2014) used dictionary-based methods for LID at word level. While other studies such as (Banerjee et al, 2014(Banerjee et al, , 2014Chittaranjan, Vyas, Bali, & Choudhury, 2014;Dahiya, 2017;Das & Gambäck, 2014;Jaech, Mulcaire, Hathi, Ostendorf, & Smith, 2016;Jhamtani, Bhogi, & Raychoudhury, 2014;King & Abney, 2013;Mandal, Banerjee, Naskar, Rosso, & Bandyopadhyay, 2015;Nguyen & Dogruoz, 2013;Řehŭřek & Kolkus, 2009) used a combination of at least two of the following methods: dictionary-based methods, rule-based methods, character n-gram modelling and heuristics based on word level features modelling.…”

Section: Introductionmentioning

confidence: 99%

A word‐level language identification strategy for resource‐scarce languages

Asubiaro

Adegbola²,

Mercer

et al. 2018

Proc. Assoc. Info. Sci. Tech.

View full text Add to dashboard Cite

This study is based on the premise that it is possible to train computers to predict the language of a word (textual or audio) by learning from its character n‐gram pattern, without recourse to the language's dictionary. With the growth of multilingual collections and a need for automatic means of cleaning textual datasets, this paper presents a strategy for language identification of individual words in a body of texts. This strategy is suitable for resource‐scarce languages that do not have large electronic datasets that are required for machine learning and natural language processing studies and whose dictionaries may not be available. In this study, we focused on three African languages, namely Hausa, Igbo, and Yoruba. A training corpus in each of these languages was used to obtain the probabilities of character trigrams in the language. Given that English is a common language that is often mixed with these resource‐scarce languages in texts, we also obtained the probabilities of trigrams in an English training corpus. These probabilities were then used in identifying the language of each word in test corpora containing bilingual texts. Our strategy achieved average precision, recall and F1 values of about 97%, 91% and 94% respectively.

show abstract

Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique

Cited by 76 publications

References 28 publications

Automatic Language Identification in Texts: A Survey

Automatic Language Identification in Texts: A Survey

A Fast, Compact, Accurate Model for Language Identification of Codemixed Text

A word‐level language identification strategy for resource‐scarce languages

Contact Info

Product

Resources

About