Incorporating Named Entity Information into Neural Machine Translation

Zhou, Leiying; Lu, Wenjie; Zhou, Jie; Meng, Kui; Liu, Gongshen

doi:10.1007/978-3-030-60450-9_31

Cited by 11 publications

(17 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Bidirectional contextual representations like BERT come at the expense of being "true" language models P LM (W ), as there may appear no way to generate text (sampling) or produce sentence probabilities (density estimation) from these models. This handicapped their use in generative tasks, where they at best served to bootstrap encoder-decoder models (Clinchant et al, 2019;Zhu et al, 2020) or unidirectional LMs .…”

Section: Pseudolikelihood Estimationmentioning

confidence: 99%

“…Existing uses of pretrained MLMs in sequenceto-sequence models for automatic speech recognition (ASR) or neural machine translation (NMT) involve integrating their weights (Clinchant et al, 2019) or representations (Zhu et al, 2020) into the encoder and/or decoder during training. In contrast, we train a sequence model independently, then rescore its n-best outputs with an existing MLM.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Masked Language Model Scoring

Salazar¹,

Liang²,

Nguyen³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

240

208

View full text Add to dashboard Cite

Pretrained masked language models (MLMs) require finetuning for most NLP tasks. Instead, we evaluate MLMs out of the box via their pseudo-log-likelihood scores (PLLs), which are computed by masking tokens one by one. We show that PLLs outperform scores from autoregressive language models like GPT-2 in a variety of tasks. By rescoring ASR and NMT hypotheses, RoBERTa reduces an endto-end LibriSpeech model's WER by 30% relative and adds up to +1.7 BLEU on state-of-theart baselines for low-resource translation pairs, with further gains from domain adaptation. We attribute this success to PLL's unsupervised expression of linguistic acceptability without a left-to-right bias, greatly improving on scores from GPT-2 (+10 points on island effects, NPI licensing in BLiMP). One can finetune MLMs to give scores without masking, enabling computation in a single inference pass. In all, PLLs and their associated pseudo-perplexities (PP-PLs) enable plug-and-play use of the growing number of pretrained MLMs; e.g., we use a single cross-lingual model to rescore translations in multiple languages. We release our library for language model scoring at https:

show abstract

Section: Pseudolikelihood Estimationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Masked Language Model Scoring

Salazar¹,

Liang²,

Nguyen³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

240

208

View full text Add to dashboard Cite

show abstract

“…The original usage of BERT mainly focused on NLP tasks, ranging from token-level and sequence-level classification tasks, including question answering [9,10], document summarization [11,12], information retrieval [13,14], machine translation [15,16], just to name a few. There has also been attempts to combine BERT in ASR, including rescoring [17,18] or generating soft labels for training [19].…”

Section: Bertmentioning

confidence: 99%

Speech Recognition by Simply Fine-Tuning Bert

Huang

Luo

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose a simple method for automatic speech recognition (ASR) by fine-tuning BERT, which is a language model (LM) trained on large-scale unlabeled text data and can generate rich contextual representations. Our assumption is that given a history context sequence, a powerful LM can narrow the range of possible choices and the speech signal can be used as a simple clue. Hence, comparing to conventional ASR systems that train a powerful acoustic model (AM) from scratch, we believe that speech recognition is possible by simply fine-tuning a BERT model. As an initial study, we demonstrate the effectiveness of the proposed idea on the AISHELL dataset and show that stacking a very simple AM on top of BERT can yield reasonable performance.

show abstract

“…Unfortunately, we did not observe good performance. For the second strategy, following the practice of [9], we use BERT to extract context-aware embeddings and fuse it into each layer of transformer encoder and decoder via an attention mechanism.…”

Section: Neural Itnmentioning

confidence: 99%

“…While RNNs are powerful for sequence to sequence tasks, transformer based models [8] offer pretraining abilities using vast amounts of data. However, incorporating pretrained models is not trivial and is often specific to the task [9].…”

Section: Introductionmentioning

confidence: 99%

Neural Inverse Text Normalization

Sunkara

Shivade

Bodapati

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

While there have been several contributions exploring state of the art techniques for text normalization, the problem of inverse text normalization (ITN) remains relatively unexplored. The best known approaches leverage finite state transducer (FST) based models which rely on manually curated rules and are hence not scalable. We propose an efficient and robust neural solution for ITN leveraging transformer based seq2seq models and FST-based text normalization techniques for data preparation. We show that this can be easily extended to other languages without the need for a linguistic expert to manually curate them. We then present a hybrid framework for integrating Neural ITN with an FST to overcome common recoverable errors in production environments. Our empirical evaluations show that the proposed solution minimizes incorrect perturbations (insertions, deletions and substitutions) to ASR output and maintains high quality even on out of domain data. A transformer based model infused with pretraining consistently achieves a lower WER across several datasets and is able to outperform baselines on English, Spanish, German and Italian datasets.

show abstract

Incorporating Named Entity Information into Neural Machine Translation

Cited by 11 publications

References 14 publications

Masked Language Model Scoring

Masked Language Model Scoring

Speech Recognition by Simply Fine-Tuning Bert

Neural Inverse Text Normalization

Contact Info

Product

Resources

About