Code Mixing: A Challenge for Language Identification in the Language of Social Media

Barman, Utsab; Das, Amitava; Wagner, Joachim; Foster, Jennifer

doi:10.3115/v1/w14-3902

Cited by 201 publications

(159 citation statements)

References 20 publications

Supporting

Mentioning

152

Contrasting

Order By: Relevance

“…Barman, Das et al [25] uses social media data for language identification in mixed script and concluded in favor of supervised learning against the dictionary-based approaches. Nagesh and Ravi [26] gave a way to perform language identification using multi class regression classifiers and was able to get nearly 54% accuracy.…”

Section: Balamurali and Joshimentioning

confidence: 99%

Sentiment Analysis of Mixed Code for The Transliterated Hindi and Marathi Texts

Ansari¹,

Govilkar²

2018

IJNLC

View full text Add to dashboard Cite

The evolution of information Technology has led to the collection of large amount of data, the volume of which has increased to the extent that in last two years the data produced is greater than all the data ever recorded in human history. This has necessitated use of machines to understand, interpret and apply data, without manual involvement. A lot of these texts are available in transliterated code-mixed form, which due to the complexity are very difficult to analyze. The work already performed in this area is progressing at great pace and this work hopes to be a way to push that work further. The designed system is an effort which classifies Hindi as well as Marathi text transliterated (Romanized) documents automatically using supervised learning methods (KNN), Naïve Bayes and Support Vector Machine (SVM)) and ontology based classification;and results are compared to in order to decide which methodology is better suited in handling of these documents. As we will see, the plain machine learning algorithm applications are just as or in many cases are much better in performance than the more analytical approach.

show abstract

Section: Balamurali and Joshimentioning

confidence: 99%

Sentiment Analysis of Mixed Code for The Transliterated Hindi and Marathi Texts

Ansari¹,

Govilkar²

2018

IJNLC

View full text Add to dashboard Cite

show abstract

“…In other research works, some ambiguity is left with regard to the words that are present in both English and Bengali either by removing them (Das and Gambäck, 2013) or by classifying them as mixed (Depending on suffixes or word-level mixing) (Barman et al, 2014). However, such ambiguity needs to be removed, if we are required to utilize such type of data for further analysis or use them for building models of sentiment and/or predictive analysis, since people generally use mixed or ambiguous words in some single language context as well, which is why they code-mix in the first place.…”

Section: Related Workmentioning

confidence: 99%

“…In both of the other research works mentioned, the groups composed their own corpus from a Facebook group and the posts and comments by members (Das and Gambäck, 2013;Barman et al, 2014). Both of the groups also use N-gram pruning and dictionary checks.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Unraveling the English-Bengali Code-Mixing Phenomenon

Chanda¹,

Das²,

Mazumdar³

2016

Proceedings of the Second Workshop on Computational Approaches to Code Switching

View full text Add to dashboard Cite

Code-mixing is a prevalent phenomenon in modern day communication. Though several systems enjoy success in identifying a single language, identifying languages of words in code-mixed texts is a herculean task, more so in a social media context. This paper explores the English-Bengali code-mixing phenomenon and presents algorithms capable of identifying the language of every word to a reasonable accuracy in specific cases and the general case. We create and test a predictorcorrector model, develop a new code-mixed corpus from Facebook chat (made available for future research) and test and compare the efficiency of various machine learning algorithms (J48, IBk, Random Forest). The paper also seeks to remove the ambiguities in the token identification process.

show abstract

“…(Sharma et al, 2016) addressed the problem of shallow parsing of HindiEnglish code-mixed social media text and developed a system for Hindi-English code-mixed text that can identify the language of the words, normalize them to their standard forms, assign them their POS tag and segment into chunks. (Barman et al, 2014) addressed the problem of language identification on Bengali-Hindi-English Facebook comments. They annotated a corpus and achieved an accuracy of 95.76% using statistical models with monolingual dictionaries.…”

Section: Introductionmentioning

confidence: 99%

Corpus Creation and Emotion Prediction for Hindi-English Code-Mixed Social Media Text

Vijay¹,

Bohra²,

Singh³

et al. 2018

Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: St

View full text Add to dashboard Cite

Emotion Prediction is a Natural Language Processing (NLP) task dealing with detection and classification of emotions in various monolingual and bilingual texts. While some work has been done on code-mixed social media text and in emotion prediction separately, our work is the first attempt which aims at identifying the emotion associated with Hindi-English code-mixed social media text. In this paper, we analyze the problem of emotion identification in code-mixed content and present a Hindi-English code-mixed corpus extracted from twitter and annotated with the associated emotion. For every tweet in the dataset, we annotate the source language of all the words present, and also the causal language of the expressed emotion. Finally, we propose a supervised classification system which uses various machine learning techniques for detecting the emotion associated with the text using a variety of character level, word level, and lexicon based features.

show abstract

Code Mixing: A Challenge for Language Identification in the Language of Social Media

Cited by 201 publications

References 20 publications

Sentiment Analysis of Mixed Code for The Transliterated Hindi and Marathi Texts

Sentiment Analysis of Mixed Code for The Transliterated Hindi and Marathi Texts

Unraveling the English-Bengali Code-Mixing Phenomenon

Corpus Creation and Emotion Prediction for Hindi-English Code-Mixed Social Media Text

Contact Info

Product

Resources

About