Word Level Language Identification in Code-Mixed Data using Word Embedding Methods for Indian Languages

Inumella, Chaitanya; Madapakula, Indeevar; Gupta, Subham; Thara, S

doi:10.1109/icacci.2018.8554501

Cited by 15 publications

(10 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Generally speaking, the recognition effect is better, referring to previous studies on language recognition (Burget, Matejka, & Cernocky, 2006;Campbell, Richardson, & Reynolds, 2007;Dehak et al, 2009;Mukherjee et al, 2018;Chaitanya et al, 2018), it is found that the recognition effect is better. The reason may be that the five languages selected in the experiment have large pronunciation differences, which is easier.…”

Section: Results Analysismentioning

confidence: 85%

Language Recognition Method of Convolutional Neural Network Based on Spectrogram

Wu¹,

Zhu²

2019

JETSS

View full text Add to dashboard Cite

Language recognition is an important branch of speech technology. As a front-end technology of speech information processing, higher recognition accuracy is required. It is found through research that there are obvious differences between the language maps of different languages, which can be used for language identification. This paper uses a convolutional neural network as a classification model, and compares the language recognition effects of traditional language recognition features and spectrogram features on the five language recognition tasks of Chinese, Japanese, Vietnamese, Russian, and Spanish through experiments. The best effect is the ivector feature, and the spectrogram feature has a higher F value than the low-dimensional ivector feature.

show abstract

Section: Results Analysismentioning

confidence: 85%

Language Recognition Method of Convolutional Neural Network Based on Spectrogram

Wu¹,

Zhu²

2019

JETSS

View full text Add to dashboard Cite

show abstract

“…Veena et al [47] utilised a linear kernel SVM classifier and could achieve an accuracy of 93% for word-level Malayalam-English and 95% for Tamil-English code-mixed LID. Chaitanya et al [48] incorporated several machine learning methods with Word2Vec embedding for Hindi-English. Based on their experiments, the SVM using Skip-gram reached the highest accuracy of 67.34%.…”

Section: 1) Machine Learning Approachmentioning

confidence: 99%

A Systematic Review on Language Identification of Code-Mixed Text: Techniques, Data Availability, Challenges, and Framework Development

et al. 2022

View full text Add to dashboard Cite

The mix of native language with other languages (code-mixing) in social media has posed a severe challenge for language identification (LID) systems. It has encouraged research on code-mixed LID solutions. This study investigated the techniques, challenges, and dataset availability with corresponding quality criteria and developed a comprehensive framework for code-mixed LID. This study addressed four research issues to identify gaps and future work opportunities in tackling code-mixed LID challenges. Based on our analysis of reviewed studies, we outlined key points for future research in code-mixed LID. We demonstrated a taxonomy of applied techniques for code-mixed LID and highlighted the different technique variants. In code-mixed LID tasks, we discovered four significant challenges: ambiguity, lexical borrowing, non-standard words, and intra-word code-mixing. This systematic literature review recognised 32 code-mixed datasets available for LID. We proposed five features to describe the quality criteria dataset. The features are the number of instances or sentences, percentage of code-mixed types in the data, number of tokens, number of unique tokens, and average sentence length. Finally, we synthesised the methodologies and proposed a conceptual framework for subsequent studies through our literature analysis.

show abstract

“…In recent years, lot of research on language identification is done, which essentially is the first step in NLP systems, although less work is done where it involves detection of multiple Indian languages. Inumella Chaitanya et al describe how common word embeddings like Continuous Bag of Words (CBOW) and Skip Grams models can be used to generate embeddings that can be feed to common machine learning models like support vector machine, Logistic Regression and K-Nearest neighbors among other algorithms [8]. Anupam Jamatia et al in their paper describe about two models i.e., Bi-LSTM classifier and Conditional Random Fields (CRF) classifier and suggest that Bi-LSTM classifier performs better.…”

Section: Related Workmentioning

confidence: 99%

On-Device Major Indian Language Identification Classifiers to Run on Low Resource Devices

Yashwanth¹

2022

IJACSA

View full text Add to dashboard Cite

Language Identification acts a first and necessary step in building intelligent Natural Language Processing (NLP) systems that handle code mixed data. There is a lot of work around this problem, but there is still scope for improvement, especially for local Indian languages. Also, earlier works mostly concentrates on just accuracy of the model and neglects the information like, whether they can be used on low resource devices like mobiles and wearable devices like smart watches with considerable latency. Here, this paper discusses about both binary classification and multiclass classification using character grams as the features. Considering total nine languages in this classification which includes, eight code mixed Indian languages with English (Hindi, Bengali, Kannada, Tamil, Telugu, Gujarati, Marathi, Malayalam) and standard English. Binary classifier discussed in this paper will classify Hinglish (Hindi when written using English script is commonly known as Hinglish) from seven other code-mixed Indian Languages with English and standard English. Multiclass classifier will classify the previously mentioned languages. Binary classifier gave an accuracy of 96% on the test data and the size of the model was 1.4 MB and achieved an accuracy of 87% with multiclass classifier on same test set with model size of 3.6 MB.

show abstract

Word Level Language Identification in Code-Mixed Data using Word Embedding Methods for Indian Languages

Cited by 15 publications

References 3 publications

Language Recognition Method of Convolutional Neural Network Based on Spectrogram

Language Recognition Method of Convolutional Neural Network Based on Spectrogram

A Systematic Review on Language Identification of Code-Mixed Text: Techniques, Data Availability, Challenges, and Framework Development

On-Device Major Indian Language Identification Classifiers to Run on Low Resource Devices

Contact Info

Product

Resources

About