Unified Multimodal Punctuation Restoration Framework for Mixed-Modality Corpus

Zhu, Yaoming; Wu, Liwei; Cheng, Shanbo; Wang, Mingxuan

doi:10.1109/icassp43922.2022.9747131

Cited by 6 publications

(3 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our results reported in Table 3 includes a comparison with current state of the art (SOTA) and best-performing models, MuSe [20] and UniPunc [21]. We also divide the reporting of EfficientPunct's results into three categories:…”

Section: Resultsmentioning

confidence: 99%

“…Current state of the art models begin in separate branches: one to tokenize and process text and the other to process raw audio waveforms. They then use the attention mechanism [19] to fuse text and acoustic embeddings [20,21].…”

Section: Related Workmentioning

confidence: 99%

“…Our primary dataset is the publicly available MuST-C version 1 [27], the same as that used by UniPunc [21] for sake of fair comparison. This dataset was compiled using TED talks.…”

Section: Datamentioning

confidence: 99%

See 2 more Smart Citations

Efficient Ensemble Architecture for Multimodal Acoustic and Textual Embeddings in Punctuation Restoration using Time-Delay Neural Networks

Liu¹,

Beigi²

2023

View full text Add to dashboard Cite

Punctuation restoration plays an essential role in the postprocessing procedure of automatic speech recognition, but model efficiency is a key requirement for this task. To that end, we present EfficientPunct, an ensemble method with a multimodal time-delay neural network that outperforms the current best model by 1.0 F1 points, using less than a tenth of its parameters to process embeddings. We streamline a speech recognizer to efficiently output hidden layer latent vectors as acoustic embeddings for punctuation restoration, as well as BERT to extract meaningful text embeddings. By using forced alignment and temporal convolutions, we eliminate the need for multihead attention-based fusion, greatly increasing computational efficiency but also raising performance. Efficient-Punct sets a new state of the art, in terms of both performance and efficiency, with an ensemble that weights BERT's purely language-based predictions slightly more than the multimodal network's predictions.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%