Proceedings of the 55th Annual Meeting of the Association For Computational Linguistics (Volume 1: Long Papers) 2017
DOI: 10.18653/v1/p17-1180
|View full text |Cite
|
Sign up to set email alerts
|

Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique

Abstract: Word-level language detection is necessary for analyzing code-switched text, where multiple languages could be mixed within a sentence. Existing models are restricted to code-switching between two specific languages and fail in real-world scenarios as text input rarely has a priori information on the languages used. We present a novel unsupervised word-level language detection technique for codeswitched text for an arbitrarily large number of languages, which does not require any manually annotated training da… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
64
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 76 publications
(67 citation statements)
references
References 28 publications
0
64
0
Order By: Relevance
“…Chen and Maison (2003) used the Markovian probabilities with Witten-Bell and modified Kneser-Ney smoothing. Giwa (2016), Balažević et al (2016), and Rijhwani et al (2017) also recently used modified Kneser-Ney discounting. Barbaresi (2016) used both original and modified Kneser-Ney smoothings.…”
Section: Good-turing Discountingmentioning
confidence: 99%
See 1 more Smart Citation
“…Chen and Maison (2003) used the Markovian probabilities with Witten-Bell and modified Kneser-Ney smoothing. Giwa (2016), Balažević et al (2016), and Rijhwani et al (2017) also recently used modified Kneser-Ney discounting. Barbaresi (2016) used both original and modified Kneser-Ney smoothings.…”
Section: Good-turing Discountingmentioning
confidence: 99%
“…Ueda and Nakagawa (1990) were the first to apply hidden Markov models (HMM) to LI. More recently HMMs have been used by Adouane and Dobnik (2017), Guzmán et al (2017), and Rijhwani et al (2017). Binas (2005) generated aggregate Markov models, which resulted in the best results when distinguishing between six languages, obtaining 74% accuracy with text length of ten characters.…”
Section: Neural Network ("Nn")mentioning
confidence: 99%
“…speakers of both Hindi and English. As much as 17% of Indian Facebook posts (Bali et al, 2014) and 3.5% of all tweets (Rijhwani et al, 2017) are codemixed. This paper addresses fine-grained (token-level) language ID, which is needed for many multilingual downstream tasks, including syntactic analysis (Bhat et al, 2018), machine translation and dialog systems.…”
Section: Introductionmentioning
confidence: 99%
“…Some prior work has focused on identifying larger language spans in longer documents (Lui et al, 2014;Jurgens et al, 2017) or estimating proportions of multiple languages in a text (Lui et al, 2014;Kocmi and Bojar, 2017). Others have focused on token-level language ID; some work is constrained to predicting word-level labels from a single language pair (Nguyen and Dogruöz, 2013;Solorio et al, 2014;Molina et al, 2016a;Sristy et al, 2017), while others permit a handful of languages (Das and Gambäck, 2014;Sristy et al, 2017;Rijhwani et al, 2017). In contrast, CMX supports 100 languages.…”
Section: Introductionmentioning
confidence: 99%
“…These methods are language-dependent and require large annotated datasets or comprehensive dictionary of the target languages. For instance, some of the recent studies such as (Barman, Wagner, Vyas, Gella, Sharma, Bali, & Choudhury, 2014;Chrupala, & Foster, 2014;Dias Cardoso & Roy, 2016;Gella, Sharma, & Bali, 2013;Lavergne, Adda, Adda-Decker, & Lamel, 2014;Piergallini, Shirvani, Gautam, & Chouikha, 2016;Rijhwani, Sequiera, Choudhury, Bali, & Maddila, 2017;Barman, Das, Wagner, & Foster, 2014) used dictionary-based methods for LID at word level. While other studies such as (Banerjee et al, 2014(Banerjee et al, , 2014Chittaranjan, Vyas, Bali, & Choudhury, 2014;Dahiya, 2017;Das & Gambäck, 2014;Jaech, Mulcaire, Hathi, Ostendorf, & Smith, 2016;Jhamtani, Bhogi, & Raychoudhury, 2014;King & Abney, 2013;Mandal, Banerjee, Naskar, Rosso, & Bandyopadhyay, 2015;Nguyen & Dogruoz, 2013;Řehŭřek & Kolkus, 2009) used a combination of at least two of the following methods: dictionary-based methods, rule-based methods, character n-gram modelling and heuristics based on word level features modelling.…”
Section: Introductionmentioning
confidence: 99%