Proceedings of the 2019 Conference of the North 2019
DOI: 10.18653/v1/n19-1352
|View full text |Cite
|
Sign up to set email alerts
|

How Large a Vocabulary Does Text Classification Need? A Variational Approach to Vocabulary Selection

Abstract: With the rapid development in deep learning, deep neural networks have been widely adopted in many real-life natural language applications. Under deep neural networks, a predefined vocabulary is required to vectorize text inputs. The canonical approach to select predefined vocabulary is based on the word frequency, where a threshold is selected to cut off the long tail distribution. However, we observed that such a simple approach could easily lead to under-sized vocabulary or oversized vocabulary issues. Ther… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
2
1
1

Relationship

1
8

Authors

Journals

citations
Cited by 18 publications
(18 citation statements)
references
References 19 publications
0
18
0
Order By: Relevance
“…A conventional cut-off used in text analysis is the first 10,000 types by frequency are included, the rest pruned from the vocabulary. Recent research has specifically investigated this conventional number, finding that a vocabulary size of less than 10,000 (e.g., 5,000) can achieve the same predictive performance in text classification as 10,000, using a formal statistical approach rather than a heuristic (Chen et al, 2019). Therefore, using a fixed, absolute threshold for vocabulary size is likely too arbitrary.…”
Section: Building a Vocabularymentioning
confidence: 99%
“…A conventional cut-off used in text analysis is the first 10,000 types by frequency are included, the rest pruned from the vocabulary. Recent research has specifically investigated this conventional number, finding that a vocabulary size of less than 10,000 (e.g., 5,000) can achieve the same predictive performance in text classification as 10,000, using a formal statistical approach rather than a heuristic (Chen et al, 2019). Therefore, using a fixed, absolute threshold for vocabulary size is likely too arbitrary.…”
Section: Building a Vocabularymentioning
confidence: 99%
“…Vocabulary selection methods and subword and character level embeddings: earlier work examined selecting a vocabulary for an NLP task. Some alternatives drop out words (Chen et al, 2019), whereas character-level methods that attempt to represent the input text at the level of individual characters (Kim et al, 2015;Ling et al, 2015) while subword methods attempt to tokenize words into parts of words in a more efficient way (Sennrich et al, 2015;Kudo and Richardson, 2018).…”
Section: Related Workmentioning
confidence: 99%
“…This reduction in vocabulary size has many advantages. Models with reduced vocabulary are more easily interpretable and achieve increased transparency (Adadi and Berrada, 2018;Samek et al, 2019), require less memory, can be used in resource constrained settings, and are less prone to overfitting (Sennrich et al, 2015;Shi and Knight, 2017;L'Hostis et al, 2016;Chen et al, 2019). However, reducing the vocabulary size with a heuristic such as frequency is often not optimal.…”
Section: Introductionmentioning
confidence: 99%
“…size) is way too high. [20] describes the limitations of unusually large vocabulary leading to poor performance.…”
Section: Hatefulmentioning
confidence: 99%