Robust Backed-off Estimation of Out-of-Vocabulary Embeddings

Fukuda, N.; Yoshinaga, Naoki

doi:10.18653/v1/2020.findings-emnlp.434

Cited by 8 publications

(8 citation statements)

References 35 publications

(55 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2) Since microblog posts are short and noisy, we practically need more than one post for typing. In fact, the accuracy of Twitter NER is very low (29.7%) for out-of-vocabulary entities (Fukuda et al, 2020).…”

Section: Task Settingsmentioning

confidence: 96%

Fine-grained Typing of Emerging Entities in Microblogs

Akasaki¹,

Yoshinaga²,

Toyoda³

2021

Findings of the Association for Computational Linguistics: EMNLP 2021

Self Cite

View full text Add to dashboard Cite

Analyzing microblogs where we post what we experience enables us to perform various applications such as social-trend analysis and entity recommendation. To track emerging trends in a variety of areas, we want to categorize information on emerging entities (e.g., Avatar 2) in microblog posts according to their types (e.g., Film). We thus introduce a new entity typing task that assigns a fine-grained type to each emerging entity when a burst of posts containing that entity is first observed in a microblog. The challenge is to perform typing from noisy microblog posts without relying on prior knowledge of the target entity. To tackle this task, we build large-scale Twitter datasets for English and Japanese using time-sensitive distant supervision. We then propose a modular neural typing model that encodes not only the entity and its contexts but also meta information in multiple posts. To type 'homographic' emerging entities (e.g., 'Go' means an emerging programming language and a classic board game), which contexts are noisy, we devise a context selector that finds related contexts of the target entity. Experiments on the Twitter datasets confirm the effectiveness of our typing model and the context selector.

show abstract

Section: Task Settingsmentioning

confidence: 96%

Fine-grained Typing of Emerging Entities in Microblogs

Akasaki¹,

Yoshinaga²,

Toyoda³

2021

Findings of the Association for Computational Linguistics: EMNLP 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…Embedding Generator Our work is also related to studies with respect to generating embeddings for out-of-vocabulary (OOV) words. In this context, researchers use embeddings of characters or subwords to predict those of unseen words (Pinter et al, 2017;Sasaki et al, 2019;Fukuda et al, 2020). For example, train an embedding generator through reconstructing the original representation of each word from its bag of subwords.…”

Section: Related Workmentioning

confidence: 99%

“…Sasaki et al (2019) progressively improve the generator using attention mechanism. Fukuda et al (2020) further leverage similar words to enhance this procedure. Our work significantly differs from the above studies in two aspects.…”

Section: Related Workmentioning

confidence: 99%

Bridging Subword Gaps in Pretrain-Finetune Paradigm for Natural Language Generation

Liu

Yang

Liu

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

A well-known limitation in pretrain-finetune paradigm lies in its inflexibility caused by the one-size-fits-all vocabulary. This potentially weakens the effect when applying pretrained models into natural language generation (NLG) tasks, especially for the subword distributions between upstream and downstream tasks with significant discrepancy. Towards approaching this problem, we extend the vanilla pretrain-finetune pipeline with an extra embedding transfer step. Specifically, a plug-and-play embedding generator is introduced to produce the representation of any input token, according to pre-trained embeddings of its morphologically similar ones. Thus, embeddings of mismatch tokens in downstream tasks can also be efficiently initialized. We conduct experiments on a variety of NLG tasks under the pretrain-finetune fashion. Experimental results and extensive analyses show that the proposed strategy offers us opportunities to feel free to transfer the vocabulary, leading to more efficient and better performed downstream NLG models. 1

show abstract

“…In [23], an iterative mimicking framework that strikes a good balance between word-level and character-level representations of words was proposed to better capture the syntactic and semantic similarities. In [24], a method was proposed to estimate OOVs' embeddings by referring to pre-trained word embeddings for known words with similar surfaces to target OOVs. In [25], the embeddings of OOVs were determined by the spelling and the contexts in which they appear.…”

Section: Related Workmentioning

confidence: 99%

Efficient Estimate of Low-Frequency Words’ Embeddings Based on the Dictionary: A Case Study on Chinese

et al. 2021

View full text Add to dashboard Cite

Obtaining high-quality embeddings of out-of-vocabularies (OOVs) and low-frequency words is a challenge in natural language processing (NLP). To efficiently estimate the embeddings of OOVs and low-frequency words, we propose a new method that uses the dictionary to estimate the embeddings of OOVs and low-frequency words. More specifically, the explanatory note of an entry in dictionaries accurately describes the semantics of the corresponding word. Naturally, we adopt the sentence representation model to extract the semantics of the explanatory note and regard the semantics as the embedding of the corresponding word. We design a new sentence representation model to encode sentences to extract the semantics from the explanatory notes of entries more efficiently. Based on the assumption that the higher quality of word embeddings will lead to better performance, we design an extrinsic experiment to evaluate the quality of low-frequency words’ embeddings. The experimental results show that the embeddings of low-frequency words estimated by our proposed method have higher quality. In addition, both intrinsic and extrinsic experiments show that our proposed sentence representation model can represent the semantics of sentences well.

show abstract

Robust Backed-off Estimation of Out-of-Vocabulary Embeddings

Cited by 8 publications

References 35 publications

Fine-grained Typing of Emerging Entities in Microblogs

Fine-grained Typing of Emerging Entities in Microblogs

Bridging Subword Gaps in Pretrain-Finetune Paradigm for Natural Language Generation

Efficient Estimate of Low-Frequency Words’ Embeddings Based on the Dictionary: A Case Study on Chinese

Contact Info

Product

Resources

About