How Large a Vocabulary Does Text Classification Need? A Variational Approach to Vocabulary Selection

Chen, Wenhu; Su, Yu; Shen, Yilin; Chen, Zhiyu; Yan, Xifeng; Wang, William Yang

doi:10.18653/v1/n19-1352

Cited by 18 publications

(18 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A conventional cut-off used in text analysis is the first 10,000 types by frequency are included, the rest pruned from the vocabulary. Recent research has specifically investigated this conventional number, finding that a vocabulary size of less than 10,000 (e.g., 5,000) can achieve the same predictive performance in text classification as 10,000, using a formal statistical approach rather than a heuristic (Chen et al, 2019). Therefore, using a fixed, absolute threshold for vocabulary size is likely too arbitrary.…”

Section: Building a Vocabularymentioning

confidence: 99%

Text Analysis for Psychology: Methods, Principles, and Practices

Kennedy¹,

Ashokkumar²,

Boyd³

et al. 2021

Preprint

View full text Add to dashboard Cite

Due to the explosion of new sources of human language data and the rapid progression of computational methods for extracting meaning from natural language, language analysis is a promising, though complicated, category of psychological research. In this chapter, we give a modern perspective on language analysis as it applies to psychology, uniting historical context, the diverse range of domains studied in psychology via language, and the methodological rigor of natural language processing (NLP) and machine learning. Top–down methods (e.g., dictionary approaches, text annotation) are presented alongside bottom–up methods (e.g., topic modeling, word embedding, language modeling) in order to give the reader a comprehensive grounding in the tools available and the recommended practices involved in applying them. We conclude with a view of the future of language analysis, specifically the ways in which psychology and NLP will continue to co-develop.

show abstract

Section: Building a Vocabularymentioning

confidence: 99%

Text Analysis for Psychology: Methods, Principles, and Practices

Kennedy¹,

Ashokkumar²,

Boyd³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Vocabulary selection methods and subword and character level embeddings: earlier work examined selecting a vocabulary for an NLP task. Some alternatives drop out words (Chen et al, 2019), whereas character-level methods that attempt to represent the input text at the level of individual characters (Kim et al, 2015;Ling et al, 2015) while subword methods attempt to tokenize words into parts of words in a more efficient way (Sennrich et al, 2015;Kudo and Richardson, 2018).…”

Section: Related Workmentioning

confidence: 99%

“…This reduction in vocabulary size has many advantages. Models with reduced vocabulary are more easily interpretable and achieve increased transparency (Adadi and Berrada, 2018;Samek et al, 2019), require less memory, can be used in resource constrained settings, and are less prone to overfitting (Sennrich et al, 2015;Shi and Knight, 2017;L'Hostis et al, 2016;Chen et al, 2019). However, reducing the vocabulary size with a heuristic such as frequency is often not optimal.…”

Section: Introductionmentioning

confidence: 99%

Game-theoretic Vocabulary Selection via the Shapley Value and Banzhaf Index

Patel¹,

Garnelo²,

Gemp³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

The input vocabulary and their learned representations are crucial to the performance of neural NLP models. Using the full vocabulary results in less explainable and more memory intensive models, with the embedding layer often constituting the majority of model parameters. It is thus common to use a smaller vocabulary to lower memory requirements and construct more interpertable models.We propose a vocabulary selection method that views words as members of a team trying to maximize the model's performance. We apply power indices from cooperative game theory, including the Shapley value and Banzhaf index, that measure the relative importance of individual team members in accomplishing a joint task. We approximately compute these indices to identify the most influential words.Our empirical evaluation examines multiple NLP tasks, including sentence and document classification, question answering and textual entailment. We compare to baselines that select words based on frequency, TF-IDF and regression coefficients under L1 regularization, and show that this game-theoretic vocabulary selection outperforms all baselines on a range of different tasks and datasets. * Work done during an internship at DeepMind.

show abstract

“…size) is way too high. [20] describes the limitations of unusually large vocabulary leading to poor performance.…”

Section: Hatefulmentioning

confidence: 99%

Online Multilingual Hate Speech Detection: Experimenting with Hindi and English Social Media

Vashistha¹,

Zubiaga²

2020

Preprint

View full text Add to dashboard Cite

The exponential increase in the use of the Internet and social media over the last two decades has changed human interaction. This has led to many positive outcomes, but at the same time it has brought risks and harms. While the volume of harmful content online, such as hate speech, is not manageable by humans, interest in the academic community to investigate automated means for hate speech detection has increased. In this study, we analyse six publicly available datasets by combining them into a single homogeneous dataset and classify them into three classes, abusive, hateful or neither. We create a baseline model and we improve model performance scores using various optimisation techniques. After attaining a competitive performance score, we create a tool which identifies and scores a page with effective metric in near-real time and uses the same as feedback to re-train our model. We prove the competitive performance of our multilingual model on two langauges, English and Hindi, leading to comparable or superior performance to most monolingual models.

show abstract

How Large a Vocabulary Does Text Classification Need? A Variational Approach to Vocabulary Selection

Cited by 18 publications

References 19 publications

Text Analysis for Psychology: Methods, Principles, and Practices

Text Analysis for Psychology: Methods, Principles, and Practices

Game-theoretic Vocabulary Selection via the Shapley Value and Banzhaf Index

Online Multilingual Hate Speech Detection: Experimenting with Hindi and English Social Media

Contact Info

Product

Resources

About