Triplet Network with Attention for Speaker Diarization

Song, Huan; Willi, Megan M.; Thiagarajan, Jayaraman J.; Berisha, Visar; Spanias, Andreas

doi:10.48550/arxiv.1808.01535

Cited by 5 publications

(5 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To further improve the model accuracy, this study uses triplet attention to improve the CSPDarknet53 feature extraction network in YOLOv4. The triplet attention module ( Song et al, 2018 ) is an inexpensive and effective attention mechanism with few parameters and does not involve dimensionality reduction. It is an additional neural network, as shown in Figure 1 .…”

Section: Methodsmentioning

confidence: 99%

“…Therefore, the YOLO algorithm can predict the category and location of multiple objects in real-time at one time. Unlike traditional object detection algorithms, which select the sliding window method and the Faster R-CNN algorithm (Song et al, 2018).…”

Section: Methodology Principle Of Yolomentioning

confidence: 99%

“…To achieve a more effective and widely applicable pest detection technology and meet the needs of using the least and most convenient operation to complete expert-level pest detection, this study combines deep learning with tomato pest detection. To achieve the goal of rapid and highly accurate detection of images of tomato pests, this study proposed a deep learning model that is fast and can perform multi-object detection based on YOLOv4 and improved it by fusing the triplet attention ( Song et al, 2018 ) mechanism. Experiments showed that the proposed model greatly improved the comprehensive detection ability of the images of tomato pests.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Tomato Pest Recognition Algorithm Based on Improved YOLOv4

Wang

Miao

et al. 2022

Front. Plant Sci.

View full text Add to dashboard Cite

Tomato plants are infected by diseases and insect pests in the growth process, which will lead to a reduction in tomato production and economic benefits for growers. At present, tomato pests are detected mainly through manual collection and classification of field samples by professionals. This manual classification method is expensive and time-consuming. The existing automatic pest detection methods based on a computer require a simple background environment of the pests and cannot locate pests. To solve these problems, based on the idea of deep learning, a tomato pest identification algorithm based on an improved YOLOv4 fusing triplet attention mechanism (YOLOv4-TAM) was proposed, and the problem of imbalances in the number of positive and negative samples in the image was addressed by introducing a focal loss function. The K-means + + clustering algorithm is used to obtain a set of anchor boxes that correspond to the pest dataset. At the same time, a labeled dataset of tomato pests was established. The proposed algorithm was tested on the established dataset, and the average recognition accuracy reached 95.2%. The experimental results show that the proposed method can effectively improve the accuracy of tomato pests, which is superior to the previous methods. Algorithmic performance on practical images of healthy and unhealthy objects shows that the proposed method is feasible for the detection of tomato pests.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodology Principle Of Yolomentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Tomato Pest Recognition Algorithm Based on Improved YOLOv4

Wang

Miao

et al. 2022

Front. Plant Sci.

View full text Add to dashboard Cite

show abstract

“…In particular, Huang J. et al [24], Ren et al [25], Kumar et al [26], and Harvill et al [27] use triplet loss with varied neural network architectures for the task of the speech emotion recognition. Bredin [28] and Song et al [29] use triplet-loss based learning approaches for the speaker diarization, and Zhang and Koshida [30] and Li et al [31] for the related task of speaker verification.…”

Section: B Previous Work On the Use Of Triplet Loss For The Metric Em...mentioning

confidence: 99%

Learning Efficient Representations for Keyword Spotting with Triplet Loss

Vygon

Mikhaylovskiy

2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

In the past few years, triplet loss-based metric embeddings have become a de-facto standard for several important computer vision problems, most notably, person reidentification. On the other hand, in the area of speech recognition the metric embeddings generated by the triplet loss are rarely used even for classification problems. We fill this gap showing that a combination of two representation learning techniques: a triplet loss-based embedding and a variant of kNN for classification instead of cross-entropy loss significantly (by 26% to 38%) improves the classification accuracy for convolutional networks on a LibriSpeech-derived LibriWords datasets. To do so, we propose a novel phonetic similarity based triplet mining approach. We also match the current best published SOTA for Google Speech Commands dataset V2 10+2-class classification with an architecture that is about 6 times more compact and improve the current best published SOTA for 35class classification on Google Speech Commands dataset V2 by over 40%. 1

show abstract

“…For instance, siamese networks (Bromley, Guyon, LeCun, Säckinger and Shah, 1994) and triplet networks (Hoffer and Ailon, 2015) are neural networks suitable for direct representation learning by minimizing the contrastive loss or triplet loss calculated in the latent embedding space. These techniques have shown promising results in face verification and identification (Schroff, Kalenichenko and Philbin, 2015) as well as in speech tasks such as speaker diarization and verification (Jati and Georgiou, 2019;Song, Willi, Thiagarajan, Berisha and Spanias, 2018).…”

Section: Related Work and Motivationmentioning

confidence: 99%

Unsupervised Speech Representation Learning for Behavior Modeling using Triplet Enhanced Contextualized Networks

Baucom

Narayanan

et al. 2021

Preprint

View full text Add to dashboard Cite

Speech encodes a wealth of information related to human behavior and has been used in a variety of automated behavior recognition tasks. However, extracting behavioral information from speech remains challenging including due to inadequate training data resources stemming from the often low occurrence frequencies of specific behavioral patterns. Moreover, supervised behavioral modeling typically relies on domain-specific construct definitions and corresponding manually-annotated data, rendering generalizing across domains challenging. In this paper, we exploit the stationary properties of human behavior within an interaction and present a representation learning method to capture behavioral information from speech in an unsupervised way. We hypothesize that nearby segments of speech share the same behavioral context and hence map onto similar underlying behavioral representations. We present an encoder-decoder based Deep Contextualized Network (DCN) as well as a Triplet-Enhanced DCN (TE-DCN) framework to capture the behavioral context and derive a manifold representation, where speech frames with similar behaviors are closer while frames of different behaviors maintain larger distances. The models are trained on movie audio data and validated on diverse domains including on a couples therapy corpus and other publicly collected data (e.g., stand-up comedy). With encouraging results, our proposed framework shows the feasibility of unsupervised learning within crossdomain behavioral modeling.

show abstract

Triplet Network with Attention for Speaker Diarization

Cited by 5 publications

References 14 publications

Tomato Pest Recognition Algorithm Based on Improved YOLOv4

Tomato Pest Recognition Algorithm Based on Improved YOLOv4

Learning Efficient Representations for Keyword Spotting with Triplet Loss

Unsupervised Speech Representation Learning for Behavior Modeling using Triplet Enhanced Contextualized Networks

Contact Info

Product

Resources

About