Can Character-based Language Models Improve Downstream Task Performances In Low-Resource And Noisy Language Scenarios?

Riabi, Arij; Sagot, Benoît; Seddah, Djamé

doi:10.18653/v1/2021.wnut-1.47

Cited by 4 publications

(5 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The DziriBERT model exhibits the best performance; however, CharacterBERT delivers competitive results while being trained on a mere 7.5% of the data used for training DziriBERT. This observation is consistent with the conclusions drawn by Riabi et al (2021).…”

Section: New Results For Udsupporting

confidence: 94%

“…In Appendix A, we present the results of all our experiments using the CharacterBERT model trained by Riabi et al (2021). We observe a heterogeneous improvement in performance, with predominantly better outcomes for our CharacterBERT.…”

Section: Impact Of the Pre-training Corpusmentioning

confidence: 96%

“…We hypothesize that the impact of filtering the training data may not be overly beneficial, possibly due to some smoothing during the training process. Both models' final training data sizes are comparable: 99k for CharacterBERT (Riabi et al, 2021) and 91k for our CharacterBERT. Nevertheless, we believe this new corpus can be a valuable resource for this language.…”

Section: Impact Of the Pre-training Corpusmentioning

confidence: 98%

See 2 more Smart Citations

Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting an Under-Resourced Language

Riabi,

Mahamdi,

Seddah

2023

Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII)

View full text Add to dashboard Cite

In this paper we address the scarcity of annotated data for NArabizi, a Romanized form of North African Arabic used mostly on social media, which poses challenges for Natural Language Processing (NLP). We introduce an enriched version of NArabizi Treebank with three main contributions: the addition of two novel annotation layers (named entity recognition and offensive language detection) and a re-annotation of the tokenization, morpho-syntactic and syntactic layers that ensure annotation consistency. Our experimental results, using different tokenization schemes, showcase the value of our contributions and highlight the impact of working with non-gold tokenization for NER and dependency parsing.To facilitate future research, we make these annotations publicly available. Our enhanced NArabizi Treebank paves the way for creating sophisticated language models and NLP tools for this under-represented language.

show abstract

Section: New Results For Udsupporting

confidence: 94%

Section: Impact Of the Pre-training Corpusmentioning

confidence: 96%

Section: Impact Of the Pre-training Corpusmentioning

confidence: 98%

See 1 more Smart Citation

Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting an Under-Resourced Language

Riabi,

Mahamdi,

Seddah

2023

Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII)

View full text Add to dashboard Cite

show abstract

“…Attia et al (2019) find that POS tags provide a strong signal for identifying code-switching. Just as code-switching is a major characteristic of AJA, it also characterizes other varieties of Algerian Arabic, and poses a challenge to Arabic NLP research (Riabi et al, 2021).…”

Section: Code-switchingmentioning

confidence: 99%

Part-of-Speech and Morphological Tagging of Algerian Judeo-Arabic

Tirosh-Becker¹,

Kessler²,

Becker³

et al. 2022

NEJLT

View full text Add to dashboard Cite

Most linguistic studies of Judeo-Arabic, the ensemble of dialects spoken and written by Jews in Arab lands, are qualitative in nature and rely on laborious manual annotation work, and are therefore limited in scale. In this work, we develop automatic methods for morpho-syntactic tagging of Algerian Judeo-Arabic texts published by Algerian Jews in the 19th--20th centuries, based on a linguistically tagged corpus. First, we describe our semi-automatic approach for preprocessing these texts. Then, we experiment with both an off-the-shelf morphological tagger and several specially designed neural network taggers. Finally, we perform a real-world evaluation of new texts that were never tagged before in comparison with human expert annotators. Our experimental results demonstrate that these methods can dramatically speed up and improve the linguistic research pipeline, enabling linguists to study these dialects on a much greater scale.

show abstract

“…and wordplay-based tasks that require attention to character-level manipulations (Riabi et al, 2021;El Boukkouri, 2020;Clark et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

What do tokens know about their characters and how do they know it?

Kaushal¹,

Mahowald²

2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Pre-trained language models (PLMs) that use subword tokenization schemes can succeed at a variety of language tasks that require characterlevel information, despite lacking explicit access to the character composition of tokens. Here, studying a range of models (e.g., GPT-J, BERT, RoBERTa, GloVe), we probe what word pieces encode about character-level information by training classifiers to predict the presence or absence of a particular alphabetical character in a token, based on its embedding (e.g., probing whether the model embedding for "cat" encodes that it contains the character "a"). We find that these models robustly encode character-level information and, in general, larger models perform better at the task. We show that these results generalize to characters from non-Latin alphabets (Arabic, Devanagari, and Cyrillic). Then, through a series of experiments and analyses, we investigate the mechanisms through which PLMs acquire English-language character information during training and argue that this knowledge is acquired through multiple phenomena, including a systematic relationship between particular characters and particular parts of speech, as well as natural variability in the tokenization of related strings.

show abstract

Can Character-based Language Models Improve Downstream Task Performances In Low-Resource And Noisy Language Scenarios?

Cited by 4 publications

References 20 publications

Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting an Under-Resourced Language

Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting an Under-Resourced Language

Part-of-Speech and Morphological Tagging of Algerian Judeo-Arabic

What do tokens know about their characters and how do they know it?

Contact Info

Product

Resources

About