2021
DOI: 10.48550/arxiv.2111.00320
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Speaker conditioning of acoustic models using affine transformation for multi-speaker speech recognition

Abstract: This study addresses the problem of single-channel Automatic Speech Recognition of a target speaker within an overlap speech scenario. In the proposed method, the hidden representations in the acoustic model are modulated by speaker auxiliary information to recognize only the desired speaker. Affine transformation layers are inserted into the acoustic model network to integrate speaker information with the acoustic features. The speaker conditioning process allows the acoustic model to perform computation in t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
1
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 21 publications
(24 reference statements)
0
1
0
Order By: Relevance
“…While extensive research has explored speaker recognition by machines [5], the current task requires expanded knowledge and capabilities. However, even for humans with normal hearing abilities, the capacity of the human auditory system to extract and separate simultaneous sources out of a mixture is severely compromised [5], [6], [7]. As reported in [8], humans are capable of detecting up to three simultaneous active speakers without using spatial information of the input mixture.…”
mentioning
confidence: 99%
“…While extensive research has explored speaker recognition by machines [5], the current task requires expanded knowledge and capabilities. However, even for humans with normal hearing abilities, the capacity of the human auditory system to extract and separate simultaneous sources out of a mixture is severely compromised [5], [6], [7]. As reported in [8], humans are capable of detecting up to three simultaneous active speakers without using spatial information of the input mixture.…”
mentioning
confidence: 99%