A Twitter Corpus for Hindi-English Code Mixed POS Tagging

Singh, Kunwarjeet; Sen, Indira; Kumaraguru, Ponnurangam

doi:10.18653/v1/w18-3503

Cited by 43 publications

(36 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We use our system to backtransliterate the Hindi English corpora from the LinCE 6 benchmark . The NER corpus is from Singh et al (2018a) and has 2,079 tweets while the POS tagging corpus is from Singh et al (2018b) and has 1,489 tweets. Some statistics about the datasets are presented in Table 7.…”

Section: Released Datasetsmentioning

confidence: 99%

Quasi Bidirectional Encoder Representations from Transformers for Word Sense Disambiguation

Bevilacqua¹,

Navigli²

2019

Proceedings - Natural Language Processing in a Deep Learning World

View full text Add to dashboard Cite

While contextualized embeddings have produced performance breakthroughs in many Natural Language Processing (NLP) tasks, Word Sense Disambiguation (WSD) has not benefited from them yet. In this paper, we introduce QBERT, a Transformerbased architecture for contextualized embeddings which makes use of a coattentive layer to produce more deeply bidirectional representations, better-fitting for the WSD task. As a result, we are able to train a WSD system that beats the state of the art on the concatenation of all evaluation datasets by over 3 points, also outperforming a comparable model using ELMo.

show abstract

Section: Released Datasetsmentioning

confidence: 99%

Quasi Bidirectional Encoder Representations from Transformers for Word Sense Disambiguation

Bevilacqua¹,

Navigli²

2019

Proceedings - Natural Language Processing in a Deep Learning World

View full text Add to dashboard Cite

show abstract

“…Different from the previous approaches, Aguilar and Solorio (2020) use language identification to create a code-switching ELMo from English ELMo (Peters et al, 2018). Later they show the effectiveness of their CS-ELMo by achieving state-of-theart POS tagging results on a Hindi-English dataset (Singh et al, 2018). They also employ multi-task learning where their auxiliary task is language identification with a simplified LID tag set for LID, POS, and NER tagging.…”

Section: Related Workmentioning

confidence: 99%

Benchmark Dataset for Propaganda Detection in Czech Newspaper Texts

Horák

Baisa

Herman

2019

Proceedings - Natural Language Processing in a Deep Learning World

View full text Add to dashboard Cite

Propaganda of various pressure groups ranging from big economies to ideological blocks is often presented in a form of objective newspaper texts. However, the real objectivity is here shaded with the support of imbalanced views and distorted attitudes by means of various manipulative stylistic techniques.In the project of Manipulative Propaganda Techniques in the Age of Internet, a new resource for automatic analysis of stylistic mechanisms for influencing the readers' opinion is developed. In its current version, the resource consists of 7,494 newspaper articles from four selected Czech digital news servers annotated for the presence of specific manipulative techniques.In this paper, we present the current state of the annotations and describe the structure of the dataset in detail. We also offer an evaluation of bag-of-words classification algorithms for the annotated manipulative techniques.

show abstract

“…We evaluate our models on five downstream tasks in the LinCE Benchmark (Aguilar et al, 2020a). We choose three named entity recognition (NER) tasks, Hindi-English (HIN-ENG) , Spanish-English (SPA-ENG) (Aguilar et al, 2018) and Modern Standard Arabic (MSA-EA) (Aguilar et al, 2018), and two part-of-speech (POS) tagging tasks, Hindi-English (HIN-ENG) (Singh et al, 2018b) and Spanish-English (SPA-ENG) (Soto and Hirschberg, 2017). We apply Roman-to-Devanagari transliteration on the Hindi-English datasets since the multilingual models are trained with data using that form.…”

Section: Datasetsmentioning

confidence: 99%

Parallel Sentence Retrieval From Comparable Corpora for Biomedical Text Simplification

Cardon

Grabar

2019

Proceedings - Natural Language Processing in a Deep Learning World

View full text Add to dashboard Cite

Parallel sentences provide semantically similar information which can vary on a given dimension, such as language or register. Parallel sentences with register variation (like expert and non-expert documents) can be exploited for the automatic text simplification. The aim of automatic text simplification is to better access and understand a given information. In the biomedical field, simplification may permit patients to understand medical and health texts. Yet, there is currently no such available resources. We propose to exploit comparable corpora which are distinguished by their registers (specialized and simplified versions) to detect and align parallel sentences. These corpora are in French and are related to the biomedical area. Manually created reference data show 0.76 inter-annotator agreement. Our purpose is to state whether a given pair of specialized and simplified sentences is parallel and can be aligned or not. We treat this task as binary classification (alignment/nonalignment). We perform experiments with a controlled ratio of imbalance and on the highly unbalanced real data. Our results show that the method we present here can be used to automatically generate a corpus of parallel sentences from our comparable corpus.

show abstract

A Twitter Corpus for Hindi-English Code Mixed POS Tagging

Cited by 43 publications

References 12 publications

Quasi Bidirectional Encoder Representations from Transformers for Word Sense Disambiguation

Quasi Bidirectional Encoder Representations from Transformers for Word Sense Disambiguation

Benchmark Dataset for Propaganda Detection in Czech Newspaper Texts

Parallel Sentence Retrieval From Comparable Corpora for Biomedical Text Simplification

Contact Info

Product

Resources

About