Scaling Hidden Markov Language Models

Chiu, Justin T.; Rushton, Gérard

doi:10.18653/v1/2020.emnlp-main.103

Cited by 21 publications

(26 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When evaluating our model with a large number of symbols, we find that only a small fraction of the symbols are predicted in the parse trees (for example, when our model uses 250 nonterminals, only tens of them are found in the predicted parse trees of the test corpus). We expect that our models can benefit from regularization techniques such as state dropout (Chiu and Rush, 2020).…”

Section: Discussionmentioning

confidence: 99%

“…For example, the best model from Petrov et al (2006) contains over 1000 nonterminal and preterminal symbols. We are also motivated by the recent work of Buhai et al (2019) who show that when learning latent variable models, increasing the number of hidden states is often helpful; and by Chiu and Rush (2020) who show that a neural hidden Markov model with up to 2 16 hidden states can achieve surprisingly good performance in language modeling. A major challenge in employing a large number of nonterminal and preterminal symbols is that representing and parsing with a PCFG requires a computational complexity that is cubic in its symbol number.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

PCFGs Can Do Better: Inducing Probabilistic Context-Free Grammars with Many Symbols

Yang¹,

Zhao²,

Tu³

2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Probabilistic context-free grammars (PCFGs) with neural parameterization have been shown to be effective in unsupervised phrasestructure grammar induction. However, due to the cubic computational complexity of PCFG representation and parsing, previous approaches cannot scale up to a relatively large number of (nonterminal and preterminal) symbols. In this work, we present a new parameterization form of PCFGs based on tensor decomposition, which has at most quadratic computational complexity in the symbol number and therefore allows us to use a much larger number of symbols. We further use neural parameterization for the new form to improve unsupervised parsing performance. We evaluate our model across ten languages and empirically demonstrate the effectiveness of using more symbols.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

PCFGs Can Do Better: Inducing Probabilistic Context-Free Grammars with Many Symbols

Yang¹,

Zhao²,

Tu³

2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

show abstract

“…For example, Dai et al (2017) and incorporate recurrent units into the hidden semi-Markov model (HSMM) to segment and label highdimensional time series; learn discrete template structures for conditional text generation using neuralized HSMM. Wessels and Omlin (2000) and Chiu and Rush (2020) factorize HMM with neural networks to scale it and improve its sequence modeling capacity. The work most related to ours leverages neural HMM for sequence labeling (Tran et al, 2016 mizes the marginal likelihood of the observations.…”

Section: Related Workmentioning

confidence: 99%

BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition

Li¹,

Shetty²,

Liu³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

We study the problem of learning a named entity recognition (NER) tagger using noisy labels from multiple weak supervision sources. Though cheap to obtain, the labels from weak supervision sources are often incomplete, inaccurate, and contradictory, making it difficult to learn an accurate NER model. To address this challenge, we propose a conditional hidden Markov model (CHMM), which can effectively infer true labels from multi-source noisy labels in an unsupervised way. CHMM enhances the classic hidden Markov model with the contextual representation power of pretrained language models. Specifically, CHMM learns token-wise transition and emission probabilities from the BERT embeddings of the input tokens to infer the latent true labels from noisy observations. We further refine CHMM with an alternate-training approach (CHMM-ALT). It fine-tunes a BERT-NER model with the labels inferred by CHMM, and this BERT-NER's output is regarded as an additional weak source to train the CHMM in return. Experiments on four NER benchmarks from various domains show that our method outperforms state-of-the-art weakly supervised NER models by wide margins.

show abstract

“…The constrained local attention in Transformer-C is adopted at all layers of models such as Longformer (Beltagy et al, 2020) and Big Bird (Zaheer et al, 2020) due to its sparsity. Our work conceptually resembles that of Chiu and Rush (2020), who modernize HMM language models, as well as simple RNN-based language models (Merity et al, 2018). Our linguistic analysis is inspired by experiments from Khandelwal et al (2018).…”

Section: Related Workmentioning

confidence: 99%

Revisiting Simple Neural Probabilistic Language Models

Sun¹,

Iyyer²

2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Recent progress in language modeling has been driven not only by advances in neural architectures, but also through hardware and optimization improvements. In this paper, we revisit the neural probabilistic language model (NPLM) of Bengio et al. (2003), which simply concatenates word embeddings within a fixed window and passes the result through a feed-forward network to predict the next word. When scaled up to modern hardware, this model (despite its many limitations) performs much better than expected on word-level language model benchmarks. Our analysis reveals that the NPLM achieves lower perplexity than a baseline Transformer with short input contexts but struggles to handle long-term dependencies. Inspired by this result, we modify the Transformer by replacing its first selfattention layer with the NPLM's local concatenation layer, which results in small but consistent perplexity decreases across three wordlevel language modeling datasets.

show abstract

Scaling Hidden Markov Language Models

Cited by 21 publications

References 17 publications

PCFGs Can Do Better: Inducing Probabilistic Context-Free Grammars with Many Symbols

PCFGs Can Do Better: Inducing Probabilistic Context-Free Grammars with Many Symbols

BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition

Revisiting Simple Neural Probabilistic Language Models

Contact Info

Product

Resources

About