Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.316
|View full text |Cite
|
Sign up to set email alerts
|

DagoBERT: Generating Derivational Morphology with a Pretrained Language Model

Abstract: Can pretrained language models (PLMs) generate derivationally complex words? We present the first study investigating this question, taking BERT as the example PLM. We examine BERT's derivational capabilities in different settings, ranging from using the unmodified pretrained model to full finetuning. Our best model, DagoBERT (Derivationally and generatively optimized BERT), clearly outperforms the previous state of the art in derivation generation (DG). Furthermore, our experiments show that the input segment… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
16
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2

Relationship

3
2

Authors

Journals

citations
Cited by 16 publications
(20 citation statements)
references
References 33 publications
(24 reference statements)
1
16
0
Order By: Relevance
“…It is divided into smaller communities, so-called subreddits, which have been shown to be a rich source of derivationally complex words (Hofmann et al, 2020c). Hofmann et al (2020a) have published a dataset of derivatives found on Reddit annotated with the subreddits in which they occur. 8 Inspired by a content-based subreddit categorization scheme, 9 we define two groups of subreddits, an entertainment set (ent) consisting of the subreddits anime, DestinyTheGame, funny, Games, gaming, leagueoflegends, movies, Music, pics, and videos, as well as a discussion set (dis) consisting of the subred-8 https://github.com/valentinhofmann/ dagobert 9 https://www.reddit.com/r/ TheoryOfReddit/comments/1f7hqc/the_200_ most_active_subreddits_categorized_by dits askscience, atheism, conspiracy, news, Libertarian, politics, science, technology, TwoXChromosomes, and worldnews, and extract all derivationally complex words occurring in them.…”
Section: Datamentioning
confidence: 99%
See 3 more Smart Citations
“…It is divided into smaller communities, so-called subreddits, which have been shown to be a rich source of derivationally complex words (Hofmann et al, 2020c). Hofmann et al (2020a) have published a dataset of derivatives found on Reddit annotated with the subreddits in which they occur. 8 Inspired by a content-based subreddit categorization scheme, 9 we define two groups of subreddits, an entertainment set (ent) consisting of the subreddits anime, DestinyTheGame, funny, Games, gaming, leagueoflegends, movies, Music, pics, and videos, as well as a discussion set (dis) consisting of the subred-8 https://github.com/valentinhofmann/ dagobert 9 https://www.reddit.com/r/ TheoryOfReddit/comments/1f7hqc/the_200_ most_active_subreddits_categorized_by dits askscience, atheism, conspiracy, news, Libertarian, politics, science, technology, TwoXChromosomes, and worldnews, and extract all derivationally complex words occurring in them.…”
Section: Datamentioning
confidence: 99%
“…The specific BERT variant we use is BERT BASE (uncased) (Devlin et al, 2019). For the derivational segmentation, we follow previous work by Hofmann et al (2020a) in separating stem and prefixes by a hyphen. We further follow Casanueva et al (2020) and in mean-pooling the output representations for all subwords, excluding BERT's special tokens.…”
Section: Modelsmentioning
confidence: 99%
See 2 more Smart Citations
“…Bostrom and Durrett (2020) argue that byte-pair encoding less faithfully expresses English morphology than Unigram segmentation, and show a performance improvement in downstream tasks with a unigramsegmentation-based BERT model. Hofmann et al (2020) show that BERT can be fine-tuned with a classification layer to complete a derivational morphology cloze task, finding that imposing morpheme boundaries with hyphenation on the input side ultimately improved BERT's performance at this task. Finally, Edmiston (2020) investigates several monolingual BERT models for representations of morphological information.…”
Section: Bert and Linguistic Competencementioning
confidence: 96%