Triplet Based Embedding Distance and Similarity Learning for Text-independent Speaker Verification

Ren, Zongze; Chen, Zhiyong; Xu, Shugong

doi:10.1109/apsipaasc47483.2019.9023253

Cited by 22 publications

(5 citation statements)

References 18 publications

(21 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unlike multi-class model that only takes class feature into account, the GE2E part can help getting similar embedding closer and separating different embedding apart. [9,12]. So we can expect that this…”

Section: Joint Multi-class and Similaritymentioning

confidence: 84%

See 1 more Smart Citation

Transformer-based Environmental Sound Classification Modeling by Jointing Multi-class Classification and Similarity Clustering

Zhang¹,

Zhao²

2022

Preprint

View full text Add to dashboard Cite

As environmental sound signal is not as regular as speech, it has varying temporal structures and is more difficult to distinguish. Previous research on Environmental Sound Classiﬁcation (ESC) has designed sophisticated methods to extract feature from raw waveforms directly, but this may not be generalized well across different ESC tasks. We proposed an end-to-end audio scene classification network which is only based on the log-mel feature. First, we used Transformer network to encode signals that can capture crucial temporal information from self-attention. Then we combined Multi-class classifier with similarity clustering so as to maximize the distance between different classes. At last, we visualized the Transformer’s ability to locate important temporal information. The performance on ESC10 and ESC50 showed that our architecture reached an average accuracy of 95.3% and 84.2%, respectively. That was an achievement of new state-of-the-art performance with only log-mel input. Meanwhile, that is nearly equivalent to the best performance of the model based on raw waveform or combined feature method.

show abstract

Section: Joint Multi-class and Similaritymentioning

confidence: 84%

“…Recently, Transformer-based models have demonstrated promising results in a variety of ASR and NLP tasks and are comparable to recurrent neural networks, as they can compute the attention weights in the whole input frame parallelly [6][7][8][9][10]. That ability would contribute to learning necessary feature from signal itself.…”

Section: Introductionmentioning

confidence: 99%

Transformer-based Environmental Sound Classification Modeling by Jointing Multi-class Classification and Similarity Clustering

Zhang¹,

Zhao²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In our approach we estimate the location likelihood of a platform using a modified version of VGG16 presented by [Kim, 2017]. The network is trained with a triplet margin loss [Veit et al, 2017] [Ren, 2019] [Hermans et al, 2017 based on a cosine distance between an anchor, positive and negative triple as shown in Eq. ( 1) and Eq.…”

Section: Approach and Experimental Resultsmentioning

confidence: 99%

Detection and Isolation of 3D Objects in Unstructured Environments

Couto¹,

Butterfield²,

Murphy³

et al. 2022

24th Irish Machine Vision and Image Processing Conference

View full text Add to dashboard Cite

3D machine vision is a growing trend in the filed of automation for Object Of Interest (OOI) interactions. This is most notable in sectors such as unorganised bin picking for manufacturing and the integration of Autonomous Guided Vehicles (AGVs) in logistics. In the literature, there is a key focus on advancing this area of research through methods of OOI recognition and isolation to simplify more established OOI analysis operations. The main constraint in current OOI isolation methods is the loss of important data and a long process duration which extends the overall run-time of 3D machine vision operations. In this paper we propose a new method of OOI isolation that utilises a combination of classical image processing techniques to reduce OOI data loss and improve run-time efficiency. Results show a high level of data retention with comparable faster run-times to previous research. This paper also hopes to present a series of run-time data points to set a standard for future process run-time comparisons.

show abstract

“…This vector is obtained from the output of a speaker verification model trained to minimize a triplet loss. This speaker verification model is pre-trained using pairs of utterances, similar to [18]. For every speaker, the vectors corresponding to all their utterances are pre-computed and then averaged to form the speaker vectors.…”

Section: Methodsmentioning

confidence: 99%

Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows

Vallés-Ṕerez¹,

Roth²,

Beringer³

et al. 2021

Preprint

View full text Add to dashboard Cite

Text-to-speech systems recently achieved almost indistinguishable quality from human speech. However, the prosody of those systems is generally flatter than natural speech, producing samples with low expressiveness. Disentanglement of speaker id and prosody is crucial in text-to-speech systems to improve on naturalness and produce more variable syntheses. This paper proposes a new neural text-to-speech model that approaches the disentanglement problem by conditioning a Tacotron2-like architecture on flow-normalized speaker embeddings, and by substituting the reference encoder with a new learned latent distribution responsible for modeling the intra-sentence variability due to the prosody. By removing the reference encoder dependency, the speaker-leakage problem typically happening in this kind of systems disappears, producing more distinctive syntheses at inference time. The new model achieves significantly higher prosody variance than the baseline in a set of quantitative prosody features, as well as higher speaker distinctiveness, without decreasing the speaker intelligibility. Finally, we observe that the normalized speaker embeddings enable much richer speaker interpolations, substantially improving the distinctiveness of the new interpolated speakers.

show abstract

Triplet Based Embedding Distance and Similarity Learning for Text-independent Speaker Verification

Cited by 22 publications

References 18 publications

Transformer-based Environmental Sound Classification Modeling by Jointing Multi-class Classification and Similarity Clustering

Transformer-based Environmental Sound Classification Modeling by Jointing Multi-class Classification and Similarity Clustering

Detection and Isolation of 3D Objects in Unstructured Environments

Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows

Contact Info

Product

Resources

About