Joining Hands: Exploiting Monolingual Treebanks for Parsing of Code-mixing Data

Bhat, Irshad Ahmad; Bhat, Riyaz Ahmad; Shrivastava, Manish; Sharma, Dipti Misra

doi:10.18653/v1/e17-2052

Cited by 25 publications

(28 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use the Universal Dependencies' Hindi-English codemixed data set (Bhat et al, 2017) to test the model's ability to label code-mixed data. This dataset is based on code-switching tweets of Hindi and English multilingual speakers.…”

Section: Codemixed Inputmentioning

confidence: 99%

Small and Practical BERT Models for Sequence Labeling

Tsai

Riesa

Johnson

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

106

View full text Add to dashboard Cite

We propose a practical scheme to train a single multilingual sequence labeling model that yields state of the art results and is small and fast enough to run on a single CPU. Starting from a public multilingual BERT checkpoint, our final model is 6x smaller and 27x faster, and has higher accuracy than a state-of-theart multilingual baseline. We show that our model especially outperforms on low-resource languages, and works on codemixed input text without being explicitly trained on codemixed examples. We showcase the effectiveness of our method by reporting on part-of-speech tagging and morphological prediction on 70 treebanks and 48 languages.

show abstract

Section: Codemixed Inputmentioning

confidence: 99%

Small and Practical BERT Models for Sequence Labeling

Tsai

Riesa

Johnson

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

106

View full text Add to dashboard Cite

show abstract

“…However, while some classes of dependency structures tolerating certain crossings have a very good empirical coverage [31,[42][43][44], these proposals still face counterexamples that fall outside the restrictions [45][46][47].…”

Section: A Minimization Of Crossingsmentioning

confidence: 99%

Scarcity of crossing dependencies: A direct outcome of a specific constraint?

Gómez-Rodríguez

Ferrer-i-Cancho

2017

Phys. Rev. E

View full text Add to dashboard Cite

The structure of a sentence can be represented as a network where vertices are words and edges indicate syntactic dependencies. Interestingly, crossing syntactic dependencies have been observed to be infrequent in human languages. This leads to the question of whether the scarcity of crossings in languages arises from an independent and specific constraint on crossings. We provide statistical evidence suggesting that this is not the case, as the proportion of dependency crossings of sentences from a wide range of languages can be accurately estimated by a simple predictor based on a null hypothesis on the local probability that two dependencies cross given their lengths. The relative error of this predictor never exceeds 5% on average, whereas the error of a baseline predictor assuming a random ordering of the words of a sentence is at least 6 times greater. Our results suggest that the low frequency of crossings in natural languages is neither originated by hidden knowledge of language nor by the undesirability of crossings per se, but as a mere side effect of the principle of dependency length minimization.

show abstract

“…The Hindi-English Code switching treebank is based on CS tweets of Hindi and English multilingual speakers (mostly Indian) (Bhat et al, 2017). The treebank is manually annotated using UD scheme.…”

Section: Hin-engmentioning

confidence: 99%

Leveraging Pretrained Word Embeddings for Part-of-Speech Tagging of Code Switching Data

AlGhamdi¹,

Diab

2019

Proceedings of the Sixth Workshop On

View full text Add to dashboard Cite

Linguistic Code Switching (CS) is a phenomenon that occurs when multilingual speakers alternate between two or more languages/dialects within a single conversation. Processing CS data is especially challenging in intra-sentential data given state-of-theart monolingual NLP technologies since such technologies are geared toward the processing of one language at a time. In this paper, we address the problem of Part-of-Speech tagging (POS) in the context of linguistic code switching (CS). We explore leveraging multiple neural network architectures to measure the impact of different pre-trained embeddings methods on POS tagging CS data. We investigate the landscape in four CS language pairs, Spanish-English, Hindi-English, Modern Standard Arabic-Egyptian Arabic dialect (MSA-EGY), and Modern Standard Arabic-Levantine Arabic dialect (MSA-LEV). Our results show that multilingual embedding (e.g., MSA-EGY and MSA-LEV) helps closely related languages (EGY/LEV) but adds noise to the languages that are distant (SPA/HIN). Finally, we show that our proposed models outperform state-of-the-art CS taggers for MSA-EGY language pair.

show abstract

Joining Hands: Exploiting Monolingual Treebanks for Parsing of Code-mixing Data

Cited by 25 publications

References 18 publications

Small and Practical BERT Models for Sequence Labeling

Small and Practical BERT Models for Sequence Labeling

Scarcity of crossing dependencies: A direct outcome of a specific constraint?

Leveraging Pretrained Word Embeddings for Part-of-Speech Tagging of Code Switching Data

Contact Info

Product

Resources

About