Audio-Visual Multi-person Keyword Spotting via Hybrid Fusion

Su, Yuxin; Miao, Ziling; Liu, Hong

doi:10.1007/978-3-031-20500-2_27

Cited by 1 publication

(10 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…T A B L E 2 Accuracy (%) in different noise cases with the method proposed by [27] on the PKU-KWS dataset.…”

Section: Results On the Lrs2-kws Datasetmentioning

confidence: 99%

“…As shown in Table 2, with the advantage of the proposed audio-visual transformer architecture, our AVKT model achieves much higher accuracy than Ref. [27]. The accuracy of the audio-only model of Ref.…”

Section: Comparison With the State-of-the-artmentioning

confidence: 92%

“…We compare the AVKT model with two state-of-the-art methods proposed in Refs. [25,27]. For the MCNN-based model [25], its relatively simple structure is difficult to converge on the sentence-level dataset.…”

Section: Comparison With the State-of-the-artmentioning

confidence: 99%

“…Note that the accuracy at −10 dB is not reported in the experimental results of Ref. [27]. The experiments are performed on the PKU-KWS dataset.…”

Section: Comparison With the State-of-the-artmentioning

confidence: 99%

“…Note: Bold text indicates the best results.Abbreviations: AO, audio-only; VO, visual-only; AV, audio-visual; Su et al:[27].T A B L E 3 Accuracy (%) of the proposed AVKT on the LRS2-KWS dataset under SNRs vary from 10 dB to −10 dB.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Audio–visual keyword transformer for unconstrained sentence‐level keyword spotting

Jia-le

Wang

et al. 2023

CAAI Trans on Intel Tech

Self Cite

View full text Add to dashboard Cite

As one of the most effective methods to improve the accuracy and robustness of speech tasks, the audio–visual fusion approach has recently been introduced into the field of Keyword Spotting (KWS). However, existing audio–visual keyword spotting models are limited to detecting isolated words, while keyword spotting for unconstrained speech is still a challenging problem. To this end, an Audio–Visual Keyword Transformer (AVKT) network is proposed to spot keywords in unconstrained video clips. The authors present a transformer classifier with learnable CLS tokens to extract distinctive keyword features from the variable‐length audio and visual inputs. The outputs of audio and visual branches are combined in a decision fusion module. As humans can easily notice whether a keyword appears in a sentence or not, our AVKT network can detect whether a video clip with a spoken sentence contains a pre‐specified keyword. Moreover, the position of the keyword is localised in the attention map without additional position labels. Experimental results on the LRS2‐KWS dataset and our newly collected PKU‐KWS dataset show that the accuracy of AVKT exceeded 99% in clean scenes and 85% in extremely noisy conditions. The code is available at https://github.com/jialeren/AVKT.

show abstract

“…T A B L E 2 Accuracy (%) in different noise cases with the method proposed by [27] on the PKU-KWS dataset.…”

Section: Results On the Lrs2-kws Datasetmentioning

confidence: 99%

Section: Comparison With the State-of-the-artmentioning

confidence: 92%