2018
DOI: 10.48550/arxiv.1803.03859
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Language Identification of Bengali-English Code-Mixed data using Character & Phonetic based LSTM Models

Abstract: Language identification of social media text still remains a challenging task due to properties like code-mixing and inconsistent phonetic transliterations. In this paper, we present a supervised learning approach for language identification at the word level of low resource Bengali-English code-mixed data taken from social media. We employ two methods of word encoding, namely character based and root phone based to train our deep LSTM models. Utilizing these two models we created two ensemble models using sta… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
10
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
1

Relationship

4
1

Authors

Journals

citations
Cited by 5 publications
(10 citation statements)
references
References 8 publications
0
10
0
Order By: Relevance
“…We have different dictionaries defined for different languages, hence, language identification using the language id is really important so that we can load up the corresponding dictionary and replace the sub-word or character. We have used a character and phonetic based LSTM model [9] to obtain language ids for various tokens present in a sentence.…”
Section: Perturbation Techniquesmentioning
confidence: 99%
See 2 more Smart Citations
“…We have different dictionaries defined for different languages, hence, language identification using the language id is really important so that we can load up the corresponding dictionary and replace the sub-word or character. We have used a character and phonetic based LSTM model [9] to obtain language ids for various tokens present in a sentence.…”
Section: Perturbation Techniquesmentioning
confidence: 99%
“…Sub-Word Perturbation : We have used a pre-existing dictionary of character groups that can be replaced by phonetically similar characters [9]. Essentially, these groups consists of character uni, bi and trigrams which are phonetically similar and are interchangeably used in social media based on user backgrounds (e.g.…”
Section: Perturbation Techniquesmentioning
confidence: 99%
See 1 more Smart Citation
“…They also experimented with several curriculum, or order in which the data is presented while training. Mandal et al (2018a) trained character and phonetic embedding models and then combined them to create an ensemble model. To the best of our knowledge, no work has been done where the amount of data used for building the supervised models is low.…”
Section: Related Workmentioning
confidence: 99%
“…As a baseline, we decided to use the character embedding based architecture described in Mandal et al (2018a), which uses stacked LSTMs (Hochreiter and Schmidhuber, 1997) We can see that the average accuracy achieved for Bn-En is 80.75% while that for Hi-En is 80.3%.…”
Section: Baseline Systemmentioning
confidence: 99%