Brian Yan scite author profile

End-to-end approaches for sequence tasks are becoming increasingly popular. Yet for complex sequence tasks, like speech translation, systems that cascade several models trained on sub-tasks have shown to be superior, suggesting that the compositionality of cascaded systems simplifies learning and enables sophisticated search capabilities. In this work, we present an end-to-end framework that exploits compositionality to learn searchable hidden representations at intermediate stages of a sequence model using decomposed sub-tasks. These hidden intermediates can be improved using beam search to enhance the overall performance and can also incorporate external models at intermediate stages of the network to re-score or adapt towards out-of-domain data. One instance of the proposed framework is a Multi-Decoder model for speech translation that extracts the searchable hidden intermediates from a speech recognition sub-task. The model demonstrates the aforementioned benefits and outperforms the previous state-of-theart by around +6 and +3 BLEU on the two test sets of Fisher-CallHome and by around +3 and +4 BLEU on the English-German and English-French test sets of

show abstract

Joint Modeling of Code-Switched and Monolingual ASR via Conditional Factorization

Yan¹,

Zhang²,

Yu³

et al. 2022

View full text Add to dashboard Cite

ESPnet-ST IWSLT 2021 Offline Speech Translation System

Inaguma¹,

Yan²,

Dalmia³

et al. 2021

View full text Add to dashboard Cite

This paper describes the ESPnet-ST group's IWSLT 2021 submission in the offline speech translation track. This year we made various efforts on training data, architecture, and audio segmentation. On the data side, we investigated sequence-level knowledge distillation (SeqKD) for end-to-end (E2E) speech translation. Specifically, we used multi-referenced SeqKD from multiple teachers trained on different amounts of bitext. On the architecture side, we adopted the Conformer encoder and the Multi-Decoder architecture, which equips dedicated decoders for speech recognition and translation tasks in a unified encoder-decoder model and enables search in both source and target language spaces during inference. We also significantly improved audio segmentation by using the pyannote.audio toolkit and merging multiple short segments for long context modeling. Experimental evaluations showed that each of them contributed to large improvements in translation performance. Our best E2E system combined all the above techniques with model ensembling and achieved 31.4 BLEU on the 2-ref of tst2021 and 21.2 BLEU and 19.3 BLEU on the two single references of tst2021. * *Equal contribution 1 https://sites.google.com/ view/iwslt-evaluation-2019/ speech-translation/off-limit-ted-talks 2 https://ict.fbk.eu/ must-c-release-v2-0/

show abstract

Differentiable Allophone Graphs for Language-Universal Speech Recognition

Yan¹,

Dalmia

Mortensen

et al. 2021

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Brian Yan

ESPnet-SLU: Advancing Spoken Language Understanding Through ESPnet

Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks

Joint Modeling of Code-Switched and Monolingual ASR via Conditional Factorization

ESPnet-ST IWSLT 2021 Offline Speech Translation System

Differentiable Allophone Graphs for Language-Universal Speech Recognition

Contact Info

Product

Resources

About