Proceedings of ACL 2018, Student Research Workshop 2018
DOI: 10.18653/v1/p18-3008
|View full text |Cite
|
Sign up to set email alerts
|

Language Identification and Named Entity Recognition in Hinglish Code Mixed Tweets

Abstract: While growing code-mixed content on Online Social Networks (OSNs) provides a fertile ground for studying various aspects of code-mixing, the lack of automated text analysis tools render such studies challenging. To meet this challenge, a family of tools for analyzing code-mixed data such as language identifiers, partsof-speech (POS) taggers, chunkers have been developed. Named Entity Recognition (NER) is an important text analysis task which is not only informative by itself, but is also needed for downstream … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
25
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 40 publications
(25 citation statements)
references
References 12 publications
0
25
0
Order By: Relevance
“…Masked Language Modeling: We fine-tune mBERT on the masked language modeling objective, following Khanuja et al (2020b), on a combination of in-domain code-mixed movie scripts and publicly available datasets by Roy et al (2013) and Bhat et al (2018) to obtain modified mBERT (mod-mBERT) to be fine-tuned on the sentencepair classification task. Transliteration: We perform token-level language identification and transliterate the detected Romanized Hindi words in CS-NLI to Devanagari script using the approach in Singh et al (2018), to enable mBERT to better understand them. Translation: Due to the difficulty in training codemixed to monolingual translation models, we follow the approach in Dhar et al (2018) to obtain translations.…”
Section: Addressing Code-mixingmentioning
confidence: 99%
“…Masked Language Modeling: We fine-tune mBERT on the masked language modeling objective, following Khanuja et al (2020b), on a combination of in-domain code-mixed movie scripts and publicly available datasets by Roy et al (2013) and Bhat et al (2018) to obtain modified mBERT (mod-mBERT) to be fine-tuned on the sentencepair classification task. Transliteration: We perform token-level language identification and transliterate the detected Romanized Hindi words in CS-NLI to Devanagari script using the approach in Singh et al (2018), to enable mBERT to better understand them. Translation: Due to the difficulty in training codemixed to monolingual translation models, we follow the approach in Dhar et al (2018) to obtain translations.…”
Section: Addressing Code-mixingmentioning
confidence: 99%
“…We use our system to backtransliterate the Hindi English corpora from the LinCE 6 benchmark . The NER corpus is from Singh et al (2018a) and has 2,079 tweets while the POS tagging corpus is from Singh et al (2018b) and has 1,489 tweets. Some statistics about the datasets are presented in Table 7.…”
Section: Released Datasetsmentioning
confidence: 99%
“…Bhargava et al (2016) proposed an algorithm which uses a hybrid approach of a dictionary cum supervised classification approach for identifying entities in Code Mixed Text of Indian Languages such as Hindi-English and Tamil-English. Nelakuditi et al (2016) reported work on annotating code mixed English-Telugu data collected from social media site Facebook and creating automatic POS Taggers for this corpus, Singh et al (2018a) presented an exploration of automatic NER of Hindi-English code-mixed data, Singh et al (2018b) presented a corpus for NER in Hindi-English Code-Mixed along with experiments on their machine learning models. To the best of our knowledge the corpus we created is the first Telugu-English code-mixed corpus with named entity tags.…”
Section: Background and Related Workmentioning
confidence: 99%