Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemic Analysis

Kim, Seong-Hu; Nam, Hyeonuk; Park, Yong-Hwa

doi:10.1109/icassp43922.2022.9747421

Cited by 18 publications

(14 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this work, we select ResNet-34 [47] with Squeeze-and-Excitation blocks [48], which is state-of-the-art network in sound event recognition tasks [21,49,50] and speaker recognition tasks [20,[51][52][53], in order to focus only on the custom sound events and customization methods. The detailed structure is described in Table II.…”

Section: Sound Event Recognition Network Architecturementioning

confidence: 99%

“…CRNN with transformer [10,11] and conformer [12] widely used in automatic speech recognition achieved state-of-the-art performance in SED [13][14][15][16][17]. CRNN with frequency dynamic convolution, which is the content-adaptive model [18][19][20], improved SED performance by considering frequency dependencies as well as temporal dependencies [21]. In addition, data augmentation methods [22][23][24] improved not only performance but also robustness of SED model.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Real-Time Sound Recognition System for Human Care Robot Considering Custom Sound Events

Kim,

Nam,

Choi

et al. 2024

IEEE Access

Self Cite

View full text Add to dashboard Cite

In real-life situations where human care robots are deployed, there are custom sound events whose acoustic characteristics change depending on the user's choice unlike general sound events so that the human care robots cannot recognize custom sound events correctly in a conventional way. To solve this critical problem, a real-time sound event recognition system with customization process is proposed. The human care robot collects custom sound samples of a specific user and customizes a sound event recognition model. The overfitting-based customized model shows the best recognition performance by improving F-scores by 66.4% on average compared to the conventional recognition model. After the customization process, the human care robot performs a real-time sound recognition by consistently streaming robot's real-time microphone signals into the overfitting-based customized SER model. In this process, an optimized overlap is applied on subsequent audio inputs on SER to achieve sufficiently fast response and robust performance. As a pilot test of the human care robot implemented in actual environment, the real-time sound recognition system shows the best average F-score of 0.982 with 75% overlap for sound events including custom sounds. This pilot test result confirms that the real-time sound recognition system with customization process can be successfully applied to human care robots to respond to the custom sounds.INDEX TERMS Sound event recognition, human care robot, custom sound event, real-time system

show abstract

Section: Sound Event Recognition Network Architecturementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Real-Time Sound Recognition System for Human Care Robot Considering Custom Sound Events

Kim,

Nam,

Choi

et al. 2024

IEEE Access

Self Cite

View full text Add to dashboard Cite

show abstract

“…In the audio domain, recent developments of dynamic convolutions involve temporal dynamic convolutions (TDY) [44] and frequency dynamic convolutions (FDY) [45]. TDY dynamically adapts the filters along the time axis to consider time-varying characteristics of speech; FDY has been shown to improve sound event detection by dynamically adapting the filters along the frequency axis, addressing the fact that the frequency dimension is not shift-invariant.…”

Section: B Dynamic Cnn Componentsmentioning

confidence: 99%

Dynamic Convolutional Neural Networks as Efficient Pre-Trained Audio Models

Schmid,

Koutini,

Widmer

2024

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

The introduction of large-scale audio datasets, such as AudioSet, paved the way for Transformers to conquer the audio domain and replace CNNs as the state-of-the-art neural network architecture for many tasks. Audio Spectrogram Transformers are excellent at exploiting large datasets, creating powerful pre-trained models that surpass CNNs when fine-tuned on downstream tasks. However, current popular Audio Spectrogram Transformers are demanding in terms of computational complexity compared to CNNs. Recently, we have shown that, by employing Transformer-to-CNN Knowledge Distillation, efficient CNNs can catch up with and even outperform Transformers on large datasets. In this work, we extend this line of research and increase the capacity of efficient CNNs by introducing dynamic CNN blocks constructed of dynamic convolutions, a dynamic ReLU activation function, and Coordinate Attention. We show that these dynamic CNNs outperform traditional efficient CNNs, such as MobileNets, in terms of the performance-complexity trade-off at the task of audio tagging on the large-scale AudioSet. Our experiments further indicate that the proposed dynamic CNNs achieve competitive performance with Transformer-based models for end-to-end fine-tuning on downstream tasks while being much more computationally efficient.

show abstract

“…These features mainly included the X-vector learned by a Time-Delay Neural Network (TDNN) [19]- [23] or an Emphasized Channel Attention, Propagation and Aggregation in TDNN (ECAPA-TDNN) [24]; the R-vector learned by a Residual Network with 34 layers (ResNet34) [25]; the S-vector learned by a Transformer [26]. In addition, other kinds of neural networks were adopted to learn deep embeddings [27]- [35], such as temporal dynamic convolutional neural network [31], Attentive Multi-scale Convolutional Recurrent Network (AMCRN) [33], Siamese neural network [34], and long short-term memory network [35].…”

Section: Related Workmentioning

confidence: 99%

Few-Shot Speaker Identification Using Lightweight Prototypical Network With Feature Grouping and Interaction

Chen

Cao

et al. 2023

IEEE Trans. Multimedia

View full text Add to dashboard Cite

Although many efforts have been made on decreasing the model complexity for speaker verification, it is still challenging to deploy speaker verification systems with satisfactory result on low-resource terminals. We design a transformation module that performs feature partition and fusion to implement lightweight speaker verification. The transformation module consists of multiple simple but effective operations, such as convolution, pooling, mean, concatenation, normalization, and element-wise summation. It works in a plug-and-play way, and can be easily implanted into a wide variety of models to reduce the model complexity while maintaining the model error. First, the input feature is split into several low-dimensional feature subsets for decreasing the model complexity. Then, each feature subset is updated by fusing it with the inter-feature-subsets correlational information to enhance its representational capability. Finally, the updated feature subsets are independently fed into the block (one or several layers) of the model for further processing. The features that are output from current block of the model are processed according to the steps above before they are fed into the next block of the model. Experimental data are selected from two public speech corpora (namely VoxCeleb1 and VoxCeleb2). Results show that implanting the transformation module into three models (namely AMCRN, ResNet34, and ECAPA-TDNN) for speaker verification slightly increases the model error and significantly decreases the model complexity. Our proposed method outperforms baseline methods on the whole in memory requirement and computational complexity with lower equal error rate. It also generalizes well across truncated segments with various lengths.

show abstract

Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemic Analysis

Cited by 18 publications

References 18 publications

Real-Time Sound Recognition System for Human Care Robot Considering Custom Sound Events

Real-Time Sound Recognition System for Human Care Robot Considering Custom Sound Events

Dynamic Convolutional Neural Networks as Efficient Pre-Trained Audio Models

Few-Shot Speaker Identification Using Lightweight Prototypical Network With Feature Grouping and Interaction

Contact Info

Product

Resources

About