Language Identification of Bengali-English Code-Mixed Data using Character &amp; Phonetic based LSTM Models

Das, Sourya Dipta; Mandal, Soumil; Das, Dipankar

doi:10.1145/3368567.3368578

Cited by 3 publications

(7 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another RNN variation technique, LSTM, has shown satisfactory performance in identifying Hindi-English and Bengali-English code-mixed text [27,52,54]. In [52], the LSTM architecture could give a high average F1 score of 93.4% and an average accuracy of 96.1% across the three classes.…”

Section: 1) Machine Learning Approachmentioning

confidence: 99%

“…The following we identified some non-standard words encountered from the investigated papers. We categorised the non-standard words into four types, such as non-standard spelling [7,15,56], abbreviated words [3,37,39,45,49,56,64], exaggerated words [3,7,27,39,45,47,[49][50][51]64], and mixing characters with numbers or special characters [3,27,39,50]. Table 6 describes some examples of non-standard words found in code-mixed text LID.…”

Section: ) Non-standard Wordsmentioning

confidence: 99%

“…Non-standard spelling [7,56] Prends or prenzz (friends), plis (please), kalo for 'kalau' (Indonesian language, meaning 'if' in English) Mixing word and numeric or special characters [3,7,27,39,50] ri8 (right), 2morrow (tomorrow), ni8t (night), orang2 (Indonesian language, meaning people in English) Word exaggeration [3,7,27,39,45,47,[49][50][51]64] goood (good), Pleasssseee (please), cooool (cool), helloooo (hello) Abbreviated words [3,39,45,49,56,64] bght (brought or bought), tkt (ticket), flm (film), TC (take care)…”

Section: Type Of Non-standard Word Examplementioning

confidence: 99%

“…Apart from that, the lexical look-up or dictionary-based approach cannot cope with the presence of borrowed words or code-mixing [20]. Another problem is the failure to get context information due to ambiguity and irregular phonetic typing in the codemixed text [16,27,28].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A Systematic Review on Language Identification of Code-Mixed Text: Techniques, Data Availability, Challenges, and Framework Development

et al. 2022

View full text Add to dashboard Cite

The mix of native language with other languages (code-mixing) in social media has posed a severe challenge for language identification (LID) systems. It has encouraged research on code-mixed LID solutions. This study investigated the techniques, challenges, and dataset availability with corresponding quality criteria and developed a comprehensive framework for code-mixed LID. This study addressed four research issues to identify gaps and future work opportunities in tackling code-mixed LID challenges. Based on our analysis of reviewed studies, we outlined key points for future research in code-mixed LID. We demonstrated a taxonomy of applied techniques for code-mixed LID and highlighted the different technique variants. In code-mixed LID tasks, we discovered four significant challenges: ambiguity, lexical borrowing, non-standard words, and intra-word code-mixing. This systematic literature review recognised 32 code-mixed datasets available for LID. We proposed five features to describe the quality criteria dataset. The features are the number of instances or sentences, percentage of code-mixed types in the data, number of tokens, number of unique tokens, and average sentence length. Finally, we synthesised the methodologies and proposed a conceptual framework for subsequent studies through our literature analysis.

show abstract

Section: 1) Machine Learning Approachmentioning

confidence: 99%

Section: ) Non-standard Wordsmentioning

confidence: 99%

Section: Type Of Non-standard Word Examplementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Systematic Review on Language Identification of Code-Mixed Text: Techniques, Data Availability, Challenges, and Framework Development

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Given an annotated corpus, the output of the language models is combined with other features to train a word-level language classifier. In addition to character n-gram information, phonetic information has also been used for word language identification in Das et al (2019).…”

Section: Related Studiesmentioning

confidence: 99%

SwitchNet: Learning to switch for word-level language identification in code-mixed social media text

2021

View full text Add to dashboard Cite

Word-level language identification is an essential prerequisite for extracting useful information from code-mixed social media content. Previous studies in word-level language identification show two important observations. First, the local context is an important indicator of the language of a word when a word is valid in multiple languages. Second, considering the word in isolation from its context leads to more effective language classification when a word is borrowed or embedded into sentences of other languages. In this paper, we propose a framework for language identification that makes use of a dynamic switching mechanism for effective language classification of both words that are borrowed or embedded from other languages as well as words that are valid in multiple languages. For a given input, the proposed switching mechanism makes a dynamic decision to bias its prediction either towards the prediction obtained by the contextual information or that obtained by the word in isolation. In contrast to existing studies that rely upon large amounts of annotated data for robust performance in a multilingual environment, the proposed approach uses minimal annotated resources and no external resources, making it easily extendible to newer languages. Evaluation over a corpus of transliterated Facebook comments shows that the proposed approach outperforms its baseline counterparts: classification based on the contextual information, classification based on the word in isolation, as well as an ensemble of the two classifiers.

show abstract

Identification of monolingual and code-switch information from English-Kannada code-switch data

Chundi

Hulipalled

Simha

2023

IJECE

View full text Add to dashboard Cite

<span lang="EN-US">Code-switching is a very common occurrence in social media communication, predominantly found in multilingual countries like India. Using more than one language in communication is known as code-switching or code-mixing. Some of the important applications of code-switch are machine translation (MT), shallow parsing, dialog systems, and semantic parsing. Identifying code-switch and monolingual information is useful for better communication in online networking websites. In this paper, we performed a character level n-gram approach to identify monolingual and code-switch information from English-Kannada social media data. We paralleled various machine learning techniques such as naïve Bayes (NB), support vector classifier (SVC), logistic regression (LR) and neural network (NN) on English-Kannada code-switch (EKCS) data. From the proposed approach, it is observed that the character level n-gram approach provides 1.8% to 4.1% of improvement in terms of Accuracy and 1.6% to 3.8% of improvement in F1-score. Also observed that SVC and NN techniques are outperformed in terms of accuracy (97.9%) and F1-score (98%) with character level n-gram.</span>

show abstract

Language Identification of Bengali-English Code-Mixed Data using Character & Phonetic based LSTM Models

Cited by 3 publications

References 5 publications

A Systematic Review on Language Identification of Code-Mixed Text: Techniques, Data Availability, Challenges, and Framework Development

A Systematic Review on Language Identification of Code-Mixed Text: Techniques, Data Availability, Challenges, and Framework Development

SwitchNet: Learning to switch for word-level language identification in code-mixed social media text

Identification of monolingual and code-switch information from English-Kannada code-switch data

Contact Info

Product

Resources

About