Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

Hofmann, Valentin; Pierrehumbert, Janet B.; Schuze, Hinrich

doi:10.18653/v1/2021.acl-long.279

Cited by 18 publications

(11 citation statements)

References 82 publications

(72 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In particular, character-level models capture complex structure in the space of words, pseudowords, and randomly generated ngrams. These findings are consistent with work suggesting that character-level and morpheme-aware representations are rich in meaning, even compared to word or sub-word models (Al-Rfou et al, 2019;El Boukkouri et al, 2020;Ma et al, 2020;Hofmann et al, 2020Hofmann et al, , 2021.…”

Section: Discussionsupporting

confidence: 89%

“…As described above, state-of-the-art language models serve as a tool to study meaning as it emerges though the distributional hypothesis paradigm. Ex- isting work on the analysis of Transformers and BERT-based models have explored themes we are interested in, such as semantics (Ethayarajh, 2019), syntax (Goldberg, 2019), morphology (Hofmann et al, 2020(Hofmann et al, , 2021, and the structure of language (Jawahar et al, 2019). However, all of this work has limited itself to the focus of extant words, largely due to the word and sub-word-based nature of these models.…”

Section: Character-level Language Models For Information Analysismentioning

confidence: 99%

See 1 more Smart Citation

Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models

Chu¹,

Desikan²,

Nadler³

et al. 2022

Preprint

View full text Add to dashboard Cite

Natural language processing models learn word representations based on the distributional hypothesis, which asserts that word context (e.g., co-occurrence) correlates with meaning. We propose that n-grams composed of random character sequences, or garble, provide a novel context for studying word meaning both within and beyond extant language. In particular, randomly generated character ngrams lack meaning but contain primitive information based on the distribution of characters they contain. By studying the embeddings of a large corpus of garble, extant language, and pseudowords using Character-BERT, we identify an axis in the model's high-dimensional embedding space that separates these classes of n-grams. Furthermore, we show that this axis relates to structure within extant language, including word partof-speech, morphology, and concept concreteness. Thus, in contrast to studies that are mainly limited to extant language, our work reveals that meaning and primitive information are intrinsically linked.

show abstract

Section: Discussionsupporting

confidence: 89%

Section: Character-level Language Models For Information Analysismentioning

confidence: 99%

Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models

Chu¹,

Desikan²,

Nadler³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…We see that all of the algorithms have a similar number of prefixes in their vocabularies, which suggests the tokenisation algorithm plays an important role, as performance differences on handling prefixes are large (Table 2) despite similar vocabularies. This is supported by work by Hofmann et al (2021), who find that employing a fixed vocabulary in a morphologically correct way leads to performance improvements. We also see, however, that Unigram has fewer suffixes in its vocabulary than default Unigram, which reflects the performance difference seen in Table 2.…”

Section: Intrinsic Evaluation: Morphological Correctnessmentioning

confidence: 74%

Improving Tokenisation by Alternative Treatment of Spaces

Edward¹,

Madabushi²,

Scarton³

et al. 2022

Preprint

View full text Add to dashboard Cite

Tokenisation is the first step in almost all NLP tasks, and state-of-the-art transformer-based language models all use subword tokenisation algorithms to process input text. Existing algorithms have problems, often producing tokenisations of limited linguistic validity, and representing equivalent strings differently depending on their position within a word. We hypothesise that these problems hinder the ability of transformer-based models to handle complex words, and suggest that these problems are a result of allowing tokens to include spaces. We thus experiment with an alternative tokenisation approach where spaces are always treated as individual tokens. Specifically, we apply this modification to the BPE and Unigram algorithms. We find that our modified algorithms lead to improved performance on downstream NLP tasks that involve handling complex words, whilst having no detrimental effect on performance in general natural language understanding tasks. Intrinsically, we find our modified algorithms give more morphologically correct tokenisations, in particular when handling prefixes. Given the results of our experiments, we advocate for always treating spaces as individual tokens as an improved tokenisation method.

show abstract

“…An important difference to the rest of this survey is that such an approach has the potential to be stronger even, as foregoing purely concatenative segmentation allows one to "segment" for example the word "hoping" as "hope V.PTCP;PRS" or "ate" as "eat PST," allowing sharing of information with other forms in the respective paradigm. The benefit of such an approach is also shown by Hofmann et al (2021), who observe that undoing derivational processes by splitting words into morphemes before tokenizing can improve sentiment and topicality classification results.…”

Section: Manually Constructed Linguistic Analyzersmentioning

confidence: 93%

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

Mielke,

Alyafeai,

Salesky

et al. 2021

Preprint

View full text Add to dashboard Cite

What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating those as discrete and atomic tokens, but starting with byte-pair encoding (BPE), subword-based approaches have become dominant in many areas, enabling small vocabularies while still allowing for fast inference. Is the end of the road character-level model or byte-level processing? In this survey, we connect several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subwordbased approaches based on learned segmentation have been proposed and evaluated. We conclude that there is and likely will never be a silver bullet singular solution for all applications and that thinking seriously about tokenization remains important for many applications. * Working group chairs 1 A Twitter bot with (human-curated) outputs of a language model based on GPT2 (Radford et al., 2019) and trained on tweets of Twitter poet @dril; https://twitter.com/ dril_gpt2/status/1373596260612067333.

show abstract

Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

Cited by 18 publications

References 82 publications

Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models

Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models

Improving Tokenisation by Alternative Treatment of Spaces

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

Contact Info

Product

Resources

About