KMI-Panlingua-IITKGP @SIGTYP2020: Exploring rules and hybrid systems for automatic prediction of typological features

Kumar, Ritesh; Alok, Deepak; Bansal, Akanksha; Lahiri, Basudev; Ojha, Atul Kr.

doi:10.18653/v1/2020.sigtyp-1.2

Cited by 6 publications

(8 citation statements)

References 1 publication

(1 reference statement)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Past Considering the past years of research, typological databases have mainly been used in the context of feature predictions. Methodologically speaking, features are typically predicted in the context of other features, and other languages (Daumé III and Campbell, 2007;Teh et al, 2009;Berzak et al, 2014;Malaviya et al, 2018;Bjerva et al, 2019cBjerva et al, ,a, 2020Bjerva et al, , 2019bVastl et al, 2020;Jäger, 2020;Choudhary, 2020;Gutkin and Sproat, 2020;Kumar et al, 2020). That is to say, given a language l ∈ L, where L is the set of all languages contained in a specific database, and the features of that language F l , the setup is typically to attempt to predict some subset of features f ⊂ F l , based on the remaining features F l \ f .…”

Section: Model Accuracymentioning

confidence: 99%

The Past, Present, and Future of Typological Databases in NLP

Baylor,

Ploeger,

Bjerva

2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Typological information has the potential to be beneficial in the development of NLP models, particularly for low-resource languages. Unfortunately, current large-scale typological databases, notably WALS and Grambank, are inconsistent both with each other and with other sources of typological information, such as linguistic grammars. Some of these inconsistencies stem from coding errors or linguistic variation, but many of the disagreements are due to the discrete categorical nature of these databases. We shed light on this issue by systematically exploring disagreements across typological databases and resources, and their uses in NLP, covering the past and present. We next investigate the future of such work, offering an argument that a continuous view of typological features is clearly beneficial, echoing recommendations from linguistics. We propose that such a view of typology has significant potential in the future, including in language modeling in low-resource scenarios.

show abstract

Section: Model Accuracymentioning

confidence: 99%

The Past, Present, and Future of Typological Databases in NLP

Baylor,

Ploeger,

Bjerva

2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

show abstract

“…Panlingua (Kumar et al, 2020), a team effort across KMI, Panlingua, and IIT KGP, submitted constrained systems from three approaches: two rule-based systems (one statistical, and one frequency-based baseline) and one hybrid system. Their baseline is similar to the organizers' frequency-base baseline, except that it produces the most frequent value for a feature within a genus if available, backing off to language family, and then the overall most-frequent value.…”

Section: Submissionsmentioning

confidence: 99%

SIGTYP 2020 Shared Task: Prediction of Typological Features

Bjerva

Salesky²,

Mielke³

et al. 2020

Proceedings of the Second Workshop on Computational Research in Linguistic Typology

View full text Add to dashboard Cite

Typological knowledge bases (KBs) such as WALS (Dryer and Haspelmath, 2013) contain information about linguistic properties of the world's languages. They have been shown to be useful for downstream applications, including cross-lingual transfer learning and linguistic probing. A major drawback hampering broader adoption of typological KBs is that they are sparsely populated, in the sense that most languages only have annotations for some features, and skewed, in that few features have wide coverage. As typological features often correlate with one another, it is possible to predict them and thus automatically populate typological KBs, which is also the focus of this shared task. Overall, the task attracted 8 submissions from 5 teams, out of which the most successful methods make use of such feature correlations. However, our error analysis reveals that even the strongest submitted systems struggle with predicting feature values for languages where few features are known.

show abstract

“…These remarks may or may not be based on an individual's protected status or protected activities such as race, color, religion, sex, national origin, sexual orientation, or gender identity of an individual [8]. By considering abusive language as an umbrella term, that covers different types of online abuse, extensive studies have been done to address hate speech [3, 8-10, 13, 15, 16], offensive language [1,2,12], cyberbullying [33,34], aggression detection [11,29,34,35], and toxicity detection [36].…”

Section: Offensive Language Detection Techniquesmentioning

confidence: 99%

“…Although many efforts have been dedicated to address the problem of hate speech and offensive language detection in high-resource languages such as English [8,9,50], recently concerns have been raised about other languages as well. Emerging recent shared tasks and academic events such as Kaggle's Toxic Comment Classification Challenge in English, Automatic Misogyny Identification (AMI) at IberEval [17] and EVALITA [4] including Spanish and Italian languages respectively, identification of offensive language at GermEval [2,51] in German language, identification of offensive language at SemEval-2019 [50] for English and SemEval-2020 [1] for Arabic, Danish, English, Greek, and Turkish languages, proceedings of the Workshop on Trolling, Aggression and Cyberbullying Workshops [34,52], and proceedings of the Workshop on Abusive Language Online [5][6][7] shows the raising concerns towards hate speech and offensive language detection in different languages. These events and shared tasks mainly focused on different types of this phenomenon such as hate, offensive, misogyny, aggression, etc.…”

Section: Language-specific Abusive Language Detectionmentioning

confidence: 99%

“…Recently great efforts have been taken to investigate the issue of hate speech detection and offensive language identification for different languages in social media; including various competitions such as Kaggle's Toxic Comment Classification Challenge (https://www.kaggle. com/c/jigsaw-toxic-comment-classification-challenge/), Jigsaw Multilingual Toxic Comment Classification (https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification), and conferences and workshops such as SemEval [1], GermEval [2], HatEval [3], EVALITA hate-speech detection task [4], the first [5], second [6], and third [7] editions of the Workshop on Abusive Language Online (https://sites.google.com/view/alw3/), the Second Workshop on Trolling, Aggression and Cyberbullying (https://sites.google.com/view/trac2/home [34]), etc. Furthermore, a great interest has been evidenced in providing annotated corpora in different aspects of offensive language such as Racism and Sexism [8], Hate and Offensive [9], Hate and NoHate [10], Non-aggressive, Overtly-aggressive or Covertly-aggressive [11], Misogynous and Non-misogynous [4], and Not Offensive and Offensive [12].…”

KMI-Panlingua-IITKGP @SIGTYP2020: Exploring rules and hybrid systems for automatic prediction of typological features

Cited by 6 publications

References 1 publication

The Past, Present, and Future of Typological Databases in NLP

The Past, Present, and Future of Typological Databases in NLP

SIGTYP 2020 Shared Task: Prediction of Typological Features

Untitled

Contact Info

Product

Resources

About