Speaker conditioning of acoustic models using affine transformation for multi-speaker speech recognition

Yousefi, Midia

doi:10.48550/arxiv.2111.00320

2021

DOI: 10.48550/arxiv.2111.00320

|View full text |Cite

Preprint

Speaker conditioning of acoustic models using affine transformation for multi-speaker speech recognition

Midia Yousefi¹

Abstract: This study addresses the problem of single-channel Automatic Speech Recognition of a target speaker within an overlap speech scenario. In the proposed method, the hidden representations in the acoustic model are modulated by speaker auxiliary information to recognize only the desired speaker. Affine transformation layers are inserted into the acoustic model network to integrate speaker information with the acoustic features. The speaker conditioning process allows the acoustic model to perform computation in t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2021

Publication Types

Select...

Other1

Relationship

Self Cite1

Independent0

Authors

Journals

Cited by 1 publication

(1 citation statement)

References 21 publications

(24 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While extensive research has explored speaker recognition by machines [5], the current task requires expanded knowledge and capabilities. However, even for humans with normal hearing abilities, the capacity of the human auditory system to extract and separate simultaneous sources out of a mixture is severely compromised [5], [6], [7]. As reported in [8], humans are capable of detecting up to three simultaneous active speakers without using spatial information of the input mixture.…”

mentioning

confidence: 99%

Single-channel speech separation using Soft-minimum Permutation Invariant Training

Yousefi¹,

Hansen²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

The goal of speech separation is to extract multiple speech sources from a single microphone recording. Recently, with the advancement of deep learning and availability of large datasets, speech separation has been formulated as a supervised learning problem. These approaches aim to learn discriminative patterns of speech, speakers, and background noise using a supervised learning algorithm, typically a deep neural network. A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal, referred to as label permutation ambiguity. Permutation ambiguity refers to the problem of determining the output-label assignment between the separated sources and the available single-speaker speech labels. Finding the best output-label assignment is required for calculation of separation error, which is later used for updating parameters of the model. Recently, Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem. However, the overconfident choice of the output-label assignment by PIT results in a sub-optimal trained model. In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment. Our proposed method entitled trainable Soft-minimum PIT is then employed on the same Long-Short Term Memory (LSTM) architecture used in Permutation Invariant Training (PIT) speech separation method. The results of our experiments show that the proposed method outperforms conventional PIT speech separation significantly (pvalue < 0.01) by +1dB in Signal to Distortion Ratio (SDR) and +1.5dB in Signal to Interference Ratio (SIR).

show abstract

mentioning

confidence: 99%

Single-channel speech separation using Soft-minimum Permutation Invariant Training

Yousefi¹,

Hansen²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Speaker conditioning of acoustic models using affine transformation for multi-speaker speech recognition

Cited by 1 publication

References 21 publications

Single-channel speech separation using Soft-minimum Permutation Invariant Training

Single-channel speech separation using Soft-minimum Permutation Invariant Training

Contact Info

Product

Resources

About