2020
DOI: 10.48550/arxiv.2004.00526
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms

Abstract: Recent advances in deep learning have facilitated the design of speaker verification systems that directly input raw waveforms. For example, RawNet extracts speaker embeddings from raw waveforms, which simplifies the process pipeline and demonstrates competitive performance. In this study, we improve RawNet by rescaling feature maps using various methods. The proposed mechanism utilizes a filter-wise rescale map that adopts a sigmoid non-linear function. It refers to a vector with dimensionality equal to the n… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 31 publications
0
1
0
Order By: Relevance
“…In this work, Conv-TasNet and RawNet2 [31] (a pretrained waveform speaker verification model) are used as the basic model and SI model, respectively. Note the choices are not unique -for example, speaker separation models such as Dualpath methods [32]- [34] could also be used as the basic model.…”
Section: B Proposed Speaker-conditioned Modelmentioning
confidence: 99%
“…In this work, Conv-TasNet and RawNet2 [31] (a pretrained waveform speaker verification model) are used as the basic model and SI model, respectively. Note the choices are not unique -for example, speaker separation models such as Dualpath methods [32]- [34] could also be used as the basic model.…”
Section: B Proposed Speaker-conditioned Modelmentioning
confidence: 99%
“…X-vectors are also embeddings extracted with Time-Delay Neural Networks (TDNN) [17][18][19] at the segment level, and they became the standard method for speaker recognition applications. Some other deep architectures employed in speaker recognition are RawNet [20,21] and ResNet [22][23][24]. These types of embeddings are extracted in an end-to-end manner, meaning that the network is in charge of both finding adequate representations as well as determining the final decision related to the speaker-related task.…”
Section: Related Workmentioning
confidence: 99%