Dual Attention in Time and Frequency Domain for Voice Activity Detection

Lee, Joohyung; Jung, Yongju; Kim, Hoirin

doi:10.21437/interspeech.2020-0997

Cited by 7 publications

(4 citation statements)

References 27 publications

(33 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A voice has frequency and amplitude. A frequency is the number of occurrence of repeating waveform per unit of time [16]. An amplitude is the maximum distance or displacement from the center of vibration when repeating vibrations occur.…”

Section: Trends Of Voice Frequency Analysis Technologymentioning

confidence: 99%

Voice Frequency Synthesis using VAW-GAN based Amplitude Scaling for Emotion Transformation

2022

KSII TIIS

View full text Add to dashboard Cite

Mostly, artificial intelligence does not show any definite change in emotions. For this reason, it is hard to demonstrate empathy in communication with humans. If frequency modification is applied to neutral emotions, or if a different emotional frequency is added to them, it is possible to develop artificial intelligence with emotions. This study proposes the emotion conversion using the Generative Adversarial Network (GAN) based voice frequency synthesis. The proposed method extracts a frequency from speech data of twenty-four actors and actresses. In other words, it extracts voice features of their different emotions, preserves linguistic features, and converts emotions only. After that, it generates a frequency in variational auto-encoding Wasserstein generative adversarial network (VAW-GAN) in order to make prosody and preserve linguistic information. That makes it possible to learn speech features in parallel. Finally, it corrects a frequency by employing Amplitude Scaling. With the use of the spectral conversion of logarithmic scale, it is converted into a frequency in consideration of human hearing features. Accordingly, the proposed technique provides the emotion conversion of speeches in order to express emotions in line with artificially generated voices or speeches.

show abstract

Section: Trends Of Voice Frequency Analysis Technologymentioning

confidence: 99%

Voice Frequency Synthesis using VAW-GAN based Amplitude Scaling for Emotion Transformation

2022

KSII TIIS

View full text Add to dashboard Cite

show abstract

“…Although E2E ASR focusing on feature extrac-Czech English French German Japanese Spanish tion in the frequency direction has been proposed, there are few examples of research on ASR models that apply an attention mechanism in the frequency direction. However, simultaneous temporal and frequency-directional attention mechanisms have been proposed in voice activity detection (VAD) [10] and speech enhancement processing [11].…”

Section: Introductionmentioning

confidence: 99%

Frequency-Directional Attention Model for Multilingual Automatic Speech Recognition

Dobashi¹,

Leow²,

Nishizaki³

2022

Preprint

View full text Add to dashboard Cite

This paper proposes a model for transforming speech features using the frequency-directional attention model for End-to-End (E2E) automatic speech recognition. The idea is based on the hypothesis that in the phoneme system of each language, the characteristics of the frequency bands of speech when uttering them are different. By transforming the input Mel filter bank features with an attention model that characterizes the frequency direction, a feature transformation suitable for ASR in each language can be expected. This paper introduces a Transformer-encoder as a frequency-directional attention model. We evaluated the proposed method on a multilingual E2E ASR system for six different languages and found that the proposed method could achieve, on average, 5.3 points higher accuracy than the ASR model for each language by introducing the frequency-directional attention mechanism. Furthermore, visualization of the attention weights based on the proposed method suggested that it is possible to transform acoustic features considering the frequency characteristics of each language.

show abstract

“…Voice activity detection (VAD) is a technique to classify an acoustic segment into speech or non-speech, which is an important frontend step in a wide range of tasks such as speaker verification [1,2], emotion estimation [3], and automatic speech recognition [4]. Although many strategies have been proposed for VAD such as time-domain-energy-based methods and likelihood-ratio-based methods [5][6][7][8], fully neural network based methods have shown promising performance even under low signal-to-noise ratio (SNR) environments [9][10][11][12][13][14][15][16].…”

Section: Introductionmentioning

confidence: 99%

Enrollment-Less Training for Personalized Voice Activity Detection

Makishima¹,

Ihori²,

Tanaka³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

We present a novel personalized voice activity detection (PVAD) learning method that does not require enrollment data during training. PVAD is a task to detect the speech segments of a specific target speaker at the frame level using enrollment speech of the target speaker. Since PVAD must learn speakers' speech variations to clarify the boundary between speakers, studies on PVAD used large-scale datasets that contain many utterances for each speaker. However, the datasets to train a PVAD model are often limited because substantial cost is needed to prepare such a dataset. In addition, we cannot utilize the datasets used to train the standard VAD because they often lack speaker labels. To solve these problems, our key idea is to use one utterance as both a kind of enrollment speech and an input to the PVAD during training, which enables PVAD training without enrollment speech. In our proposed method, called enrollment-less training, we augment one utterance so as to create variability between the input and the enrollment speech while keeping the speaker identity, which avoids the mismatch between training and inference. Our experimental results demonstrate the efficacy of the method.

show abstract

Dual Attention in Time and Frequency Domain for Voice Activity Detection

Cited by 7 publications

References 27 publications

Voice Frequency Synthesis using VAW-GAN based Amplitude Scaling for Emotion Transformation

Voice Frequency Synthesis using VAW-GAN based Amplitude Scaling for Emotion Transformation

Frequency-Directional Attention Model for Multilingual Automatic Speech Recognition

Enrollment-Less Training for Personalized Voice Activity Detection

Contact Info

Product

Resources

About