Named Entity Recognition for Hindi-English Code-Mixed Social Media Text

Singh, Vinay; Vijay, Deepanshu; Akhtar, Syed Sarfaraz; Shrivastava, Manish

doi:10.18653/v1/w18-2405

Cited by 51 publications

(23 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They used CRF as classifier for their NER task. Singh et al presented a named entity recognition system for Hindi-English codemixed social media text (twitter) using word, character and lexical features [53]. Sabty et al proposed a NER system for identifying NEs from Arabic-English Code-Mixed Data [54].…”

Section: Related Workmentioning

confidence: 99%

Word Embedding and String-Matching Techniques for Automobile Entity Name Identification from Web Reviews

Maity

Das²,

Majumder

et al. 2021

ICST Transactions on Scalable Information Systems

View full text Add to dashboard Cite

With the huge popularity of Internet, various types of information on a wide range of domains are floating over different social media platforms. To extract this information for using in diverse natural language processing applications, identifying the names is prerequisite. A study is presented here, to identify automobile names from noisy web reviews by exploring two widely used machine learning algorithms, Conditional Random Field and Support Vector Machine. The accuracy of machine learning classifiers radically rely on size and quality of training data which has been prepared manually by extracting discussion forum corpus; the task is time consuming and laborious; hence to leverage this word embedding is adopted. Though it enhances the system's performance but is unable to spot noisy names which occur in web reviews. Next, a gazetteer based string matching technique is proposed, it recognizes a new set of noisy automobile entities, resulting considerable improvement in accuracy.

show abstract

Section: Related Workmentioning

confidence: 99%

Word Embedding and String-Matching Techniques for Automobile Entity Name Identification from Web Reviews

Maity

Das²,

Majumder

et al. 2021

ICST Transactions on Scalable Information Systems

View full text Add to dashboard Cite

show abstract

“…Gupta et al (2014) introduced the concept of Mixed-Script Information Retrieval and the problems posed by transliterated content such as spelling variations etc. There has been a surge of data set creation for code-mixed data (Bhat et al, 2017; and application based tools such as question classification (Raghavi et al, 2015), named-entity recognition (Singh et al, 2018), sentiment analysis (Prabhu et al, 2016;Ghosh et al, 2017) and so on. We built our corpus on syntactic information obtained from dependency labels.…”

Section: Background and Related Workmentioning

confidence: 99%

A Dataset for Semantic Role Labelling of Hindi-English Code-Mixed Tweets

Pal¹,

Sharma

2019

Proceedings of the 13th Linguistic Annotation Workshop

View full text Add to dashboard Cite

We present a data set of 1460 Hindi-English code-mixed tweets consisting of 20,949 tokens labelled with Proposition Bank labels marking their semantic roles. We created verb frames for complex predicates present in the corpus and formulated mappings from Paninian dependency labels to Proposition Bank labels. With the help of these mappings and the dependency tree, we propose a baseline rule based system for Semantic Role Labelling of Hindi-English code-mixed data. We obtain an accuracy of 96.74% for Argument Identification and are able to further classify 73.93% of the labels correctly. While there is relevant ongoing research on Semantic Role Labelling (SRL) and on building tools for code-mixed social media data, this is the first attempt at labelling semantic roles in Hindi-English codemixed data, to the best of our knowledge.

show abstract

“…Bhargava et al (2016) proposed an algorithm which uses a hybrid approach of a dictionary cum supervised classification approach for identifying entities in Code Mixed Text of Indian Languages such as Hindi-English and Tamil-English. Nelakuditi et al (2016) reported work on annotating code mixed English-Telugu data collected from social media site Facebook and creating automatic POS Taggers for this corpus, Singh et al (2018a) presented an exploration of automatic NER of Hindi-English code-mixed data, Singh et al (2018b) presented a corpus for NER in Hindi-English Code-Mixed along with experiments on their machine learning models. To the best of our knowledge the corpus we created is the first Telugu-English code-mixed corpus with named entity tags.…”

Section: T1mentioning

confidence: 99%

Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data

Srirangam¹,

Reddy²,

Singh

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Self Cite

View full text Add to dashboard Cite

Named Entity Recognition(NER) is one of the important tasks in Natural Language Processing(NLP) and also is a sub task of Information Extraction. In this paper we present our work on NER in Telugu-English code-mixed social media data. Code-Mixing, a progeny of multilingualism is a way in which multilingual people express themselves on social media by using linguistics units from different languages within a sentence or speech context. Entity Extraction from social media data such as tweets(twitter) 1 is in general difficult due to its informal nature, code-mixed data further complicates the problem due to its informal, unstructured and incomplete information. We present a Telugu-English code-mixed corpus with the corresponding named entity tags. The named entities used to tag data are Person('Per'), Organization('Org') and Location('Loc'). We experimented with the machine learning models Conditional Random Fields(CRFs), Decision Trees and Bidirectional LSTMs on our corpus which resulted in a F1-score of 0.96, 0.94 and 0.95 respectively.

show abstract

Named Entity Recognition for Hindi-English Code-Mixed Social Media Text

Cited by 51 publications

References 16 publications

Word Embedding and String-Matching Techniques for Automobile Entity Name Identification from Web Reviews

Word Embedding and String-Matching Techniques for Automobile Entity Name Identification from Web Reviews

A Dataset for Semantic Role Labelling of Hindi-English Code-Mixed Tweets

Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data

Contact Info

Product

Resources

About