NMTPY: A Flexible Toolkit for Advanced Neural Machine Translation Systems

Humans are capable of processing speech by making use of multiple sensory modalities. For example, the environment where a conversation takes place generally provides semantic and/or acoustic context that helps us to resolve ambiguities or to recall named entities. Motivated by this, there have been many works studying the integration of visual information into the speech recognition pipeline. Specifically, in our previous work, we propose a multistep visual adaptive training approach which improves the accuracy of an audio-based Automatic Speech Recognition (ASR) system. This approach, however, is not end-to-end as it requires fine-tuning the whole model with an adaptation layer. In this paper, we propose novel end-to-end multimodal ASR systems and compare them to the adaptive approach by using a range of visual representations obtained from state-of-the-art convolutional neural networks. We show that adaptive training is effective for S2S models leading to an absolute improvement of 1.4% in word error rate. As for the end-to-end systems, although they perform better than baseline, the improvements are slightly less than adaptive training, 0.8 absolute WER reduction in singlebest models. Using ensemble decoding, end-to-end models reach a WER of 15% which is the lowest score among all systems.

show abstract

“…We decode hypotheses using a beam size of 10. The experiments are conducted using nmtpytorch 1 [23].…”

Section: Resultsmentioning

confidence: 99%

Multimodal Grounding for Sequence-to-sequence Speech Recognition

Çağlayan

Sanabria

Palaskar

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The model hyper parameters are the same ones as in [10]. All models are implemented using the nmtpytorch framework [21]. For each experiment, we train three models with different random seeds and report the average results.…”

Section: Model Implementationmentioning

confidence: 99%

Looking Enhances Listening: Recovering Missing Speech Using Images

Srinivasan

Sanabria

Metze

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speech is understood better by using visual context; for this reason, there have been many attempts to use images to adapt automatic speech recognition (ASR) systems. Current work, however, has shown that visually adapted ASR models only use images as a regularization signal, while completely ignoring their semantic content. In this paper, we present a set of experiments where we show the utility of the visual modality under noisy conditions. Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations. We observe that integrating visual context can result in up to 35% relative improvement in masked word recovery. These results demonstrate that end-to-end multimodal ASR systems can become more robust to noise by leveraging the visual context.

show abstract

“…By using ensembling 3 networks with different configs and rescoring using a model trained with reversed target sentences, we managed to reach 26.96 BLEU score for the development set, which yields 2.8 point of improvement compared to the baseline model. Details about the effect of each technique is described in Pham et al (2017) 3.3 LIMSI LIMSI's intput to this system combination consists of two NMT systems, both trained with the NMTPY framework (Caglayan et al, 2017) on bitext, then on synthetic parallel data. All of them were rescored with a Nematus system (Sennrich et al, 2017b).…”

Section: Kitmentioning

confidence: 99%

The QT21 Combined Machine Translation System for English to Latvian

Peter¹,

Ney²,

Bojar³

et al. 2017

Proceedings of the Second Conference on Machine Translation

View full text Add to dashboard Cite

This paper describes the joint submission of the QT21 projects for the English→Latvian translation task of the EMNLP 2017 Second Conference on Machine Translation (WMT 2017). The submission is a system combination which combines seven different statistical machine translation systems provided by the different groups.The systems are combined using either RWTH's system combination approach, or USFD's consensus-based systemselection approach. The final submission shows an improvement of 0.5 BLEU compared to the best single system on newstest2017.

show abstract

NMTPY: A Flexible Toolkit for Advanced Neural Machine Translation Systems

Cited by 57 publications

References 21 publications

Multimodal Grounding for Sequence-to-sequence Speech Recognition

Multimodal Grounding for Sequence-to-sequence Speech Recognition

Looking Enhances Listening: Recovering Missing Speech Using Images

The QT21 Combined Machine Translation System for English to Latvian

Contact Info

Product

Resources

About