On the Importance of Subword Information for Morphological Tasks in Truly Low-Resource Languages

Zhu, Yi; Heinzerling, Benjamin; Vulić, Ivan; Strube, Michael; Reichart, Roi; Korhonen, Anna

doi:10.18653/v1/k19-1021

Cited by 13 publications

(6 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The majority of the world's languages are synthetic, meaning they have rich morphology. As a result, modeling morphological inflection computationally can have a significant impact on downstream quality, not only in analysis tasks such as named entity recognition and morphological analysis (Zhu et al, 2019), but also for language generation systems for morphologically-rich languages.…”

Section: Introductionmentioning

confidence: 99%

Transliteration for Cross-Lingual Morphological Inflection

Murikinati¹,

Anastasopoulos²,

Neubig³

2020

Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

View full text Add to dashboard Cite

Cross-lingual transfer between typologically related languages has been proven successful for the task of morphological inflection. However, if the languages do not share the same script, current methods yield more modest improvements. We explore the use of transliteration between related languages, as well as grapheme-to-phoneme conversion, as data preprocessing methods in order to alleviate this issue. We experimented with several diverse language pairs, finding that in most cases transliterating the transfer language data into the target one leads to accuracy improvements, even up to 9 percentage points. Converting both languages into a shared space like the International Phonetic Alphabet or the Latin alphabet is also beneficial, leading to improvements of up to 16 percentage points. 1

show abstract

Section: Introductionmentioning

confidence: 99%

Transliteration for Cross-Lingual Morphological Inflection

Murikinati¹,

Anastasopoulos²,

Neubig³

2020

Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

View full text Add to dashboard Cite

show abstract

“…to construct the final word vector, and conclude that the best performing configuration is highly language and task dependent. A subsequent work (Zhu et al, 2019a) focuses on LRLs and finds the combination of BPE and addition largely robust, although they once again note language-dependent variability. They also find that encoding "affix" information with positional embeddings is beneficial, hinting that the embedding space may distinguish the importance of different kinds of subwords.…”

Section: Subwords In Embedding Spacesmentioning

confidence: 95%

Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

2022

View full text Add to dashboard Cite

Speech consists of a continuously-varying acoustic signal. Yet human listeners experience it as sequences of discrete speech sounds, which are used to recognise words. To examine how the human brain appropriately sequences the speech signal, we recorded two-hour magnetoencephalograms from 21 subjects listening to short narratives. Our analyses show that the brain continuously encodes the three most recently heard speech sounds in parallel, and maintains this information long past the sensory input. Each speech sound has a representation that evolves over time, jointly encoding both its phonetic features and time elapsed since onset. This allows the brain to represent the relative order and phonetic content of the phonetic sequence. These dynamic representations are active earlier when phonemes are more predictable, and are sustained longer when lexical identity is uncertain. The flexibility in the dynamics of these representations paves the way for further understanding of how such sequences may be used to interface with higher order structure such as morphemes and words.Bio: Laura Gwilliams received her PhD in Psychology with a focus in Cognitive Neuroscience from New York University in May 2020. Currently she is a post-doctoral researcher at UCSF, using MEG and ECoG data to understand how linguistic structures are parsed and composed while listening to continuous speech. The ultimate goal of Laura's research is to describe speech comprehension in terms of what operations are applied to the acoustic signal; which representational formats are generated and manipulated (e.g. phonetic, syllabic, morphological), and under what processing architecture.

show abstract

“…BERT, for example, has been proven sensitive to (non-adversarial) human noise Sun et al (2020); Kumar et al (2020). Examples of models that can be more resilient to noise include typological language models Gerz et al (2018); Ponti et al (2019), sub-word or character-level language models Kim et al (2016); Zhu et al (2019); Ma et al (2020), byte-pair encoding Sennrich et al (2016), and their extension in recent tokenization-free models (Heinzerling and Strube, 2018;Clark et al, 2021;Xue et al, 2021), yet their use as noise-resilient language models remains to be fully assessed.…”

Section: Related Workmentioning

confidence: 99%

An Assessment of the Impact of OCR Noise on Language Models

Todorov¹,

Colavizza²

2022

Preprint

View full text Add to dashboard Cite

Neural language models are the backbone of modern-day natural language processing applications. Their use on textual heritage collections which have undergone Optical Character Recognition (OCR) is therefore also increasing. Nevertheless, our understanding of the impact OCR noise could have on language models is still limited. We perform an assessment of the impact OCR noise has on a variety of language models, using data in Dutch, English, French and German. We find that OCR noise poses a significant obstacle to language modelling, with language models increasingly diverging from their noiseless targets as OCR quality lowers. In the presence of small corpora, simpler models including PPMI and Word2Vec consistently outperform transformer-based models in this respect.

show abstract

On the Importance of Subword Information for Morphological Tasks in Truly Low-Resource Languages

Cited by 13 publications

References 37 publications

Transliteration for Cross-Lingual Morphological Inflection

Transliteration for Cross-Lingual Morphological Inflection

Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

An Assessment of the Impact of OCR Noise on Language Models

Contact Info

Product

Resources

About