Javier Hernando scite author profile

Abstract-When performing speaker diarization on recordings from meetings, multiple microphones of different qualities are usually available and distributed around the meeting room. Although several approaches have been proposed in recent years to take advantage of multiple microphones, they are either too computationally expensive and not easily scalable or they can not outperform the simpler case of using the best single microphone. In this work the use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain. New techniques we are present include blind reference-channel selection, two-step Time Delay of Arrival (TDOA) Viterbi postprocessing, and a dynamic output signal weighting algorithm, together with using such TDOA values in the diarization to complement the acoustic information. Tests on speaker diarization show a 25% relative improvement on the test set compared to using a single most centrally located microphone. Additional experimental results show improvements using these techniques in a speech recognition task.Index Terms-acoustic beamforming, speaker diarization, speaker segmentation and clustering, meetings processing.

show abstract

Time and frequency filtering of filter-bank energies for robust HMM speech recognition

Nadeu

Macho

Hernando

2001

Speech Communication

115

View full text Add to dashboard Cite

Deep belief networks for i-vector based speaker recognition

Ghahabi

Hernando

2014

View full text Add to dashboard Cite

The use of Deep Belief Networks (DBNs) is proposed in this paper to model discriminatively target and impostor i-vectors in a speaker verification task. The authors propose to adapt the network parameters of each speaker from a background model, which will be referred to as Universal DBN (UDBN). It is also suggested to backpropagate class errors up to only one layer for few iterations before to train the network. Additionally, an impostor selection method is introduced which helps the DBN to outperform the cosine distance classifier. The evaluation is performed on the core test condition of the NIST SRE 2006 corpora, and it is shown that 10% and 8% relative improvements of EER and minDCF can be achieved, respectively.Peer ReviewedPostprint (published version

show abstract

Self Multi-Head Attention for Speaker Recognition

India¹,

Safari²,

Hernando³

2019

View full text Add to dashboard Cite

Most state-of-the-art Deep Learning (DL) approaches for speaker recognition work on a short utterance level. Given the speech signal, these algorithms extract a sequence of speaker embeddings from short segments and those are averaged to obtain an utterance level speaker representation. In this work we propose the use of an attention mechanism to obtain a discriminative speaker embedding given non fixed length speech utterances. Our system is based on a Convolutional Neural Network (CNN) that encodes short-term speaker features from the spectrogram and a self multi-head attention model that maps these representations into a long-term speaker embedding. The attention model that we propose produces multiple alignments from different subsegments of the CNN encoded states over the sequence. Hence this mechanism works as a pooling layer which decides the most discriminative features over the sequence to obtain an utterance level representation. We have tested this approach for the verification task for the VoxCeleb1 dataset. The results show that self multi-head attention outperforms both temporal and statistical pooling methods with a 18% of relative EER. Obtained results show a 58% relative improvement in EER compared to i-vector+PLDA.

show abstract

A deep analysis on age estimation

Huerta

Fernández²,

Segura³

et al. 2015

Pattern Recognition Letters

View full text Add to dashboard Cite

The automatic estimation of age from face images is increasingly gaining attention, as it facilitates applications including advanced video surveillance, demographic statistics collection, customer profiling, or search optimization in large databases. Nevertheless, it becomes challenging to estimate age from uncontrollable environments, with insufficient and incomplete training data, dealing with strong person-specificity and high within-range variance. These difficulties have been recently addressed with complex and strongly hand-crafted descriptors, difficult to replicate and compare. This paper presents two novel approaches: first, a simple yet effective fusion of descriptors based on texture and local appearance; and second, a deep learning scheme for accurate age estimation. These methods have been evaluated under a diversity of settings, and the extensive experiments carried out on two large databases (MORPH and FRGC) demonstrate state-of-the-art results over previous work.

show abstract

Frequency and time filtering of filter-bank energies for HMM speech recognition

Nadeu

Acebal

Hernando

et al.

View full text Add to dashboard Cite

In speech recognition, a discriminative quefrency weighting can be achieved by somewhat decorrelating the frequency sequence of log mel-scaled filter-bank energies with a computationally inexpensive filter. In this paper, we show how the spectral parameters that result from this kind of frequency filtering, both alone and combined with filtering of their time trajectories, are competitive with respect to the conventional cepstral representations of speech signals.

show abstract

Using Jitter and Shimmer in speaker verification

Farrús

Hernando

2009

IET Signal Process.

View full text Add to dashboard Cite

Jitter and shimmer are measures of the fundamental frequency and amplitude cycle-to-cycle variations, respectively. Both features have been largely used for the description of pathological voices, and since they characterise some aspects concerning particular voices, they are expected to have a certain degree of speaker specificity. In the current work, jitter and shimmer are successfully used in a speaker verification experiment. Moreover, both measures are combined with spectral and prosodic features using several types of normalisation and fusion techniques in order to obtain better verification results. The overall speaker verification system is also improved by using histogram equalisation as a normalisation technique previous to fusing the features by SVM.

show abstract

Deep Learning Backend for Single and Multisession i-Vector Speaker Recognition

Ghahabi

Hernando

2017

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

The promising performance of Deep Learning (DL) in speech recognition has motivated the use of DL in other speech technology applications such as speaker recognition. Given ivectors as inputs, the authors proposed an impostor selection algorithm and a universal model adaptation process in a hybrid system based on Deep Belief Networks (DBN) and Deep Neural Networks (DNN) to discriminatively model each target speaker. In order to have more insight into the behavior of DL techniques in both single and multi-session speaker enrollment tasks, some experiments have been carried out in this paper in both scenarios. Additionally, the parameters of the global model, referred to as universal DBN (UDBN), are normalized before adaptation. UDBN normalization facilitates training DNNs specifically with more than one hidden layer. Experiments are performed on the NIST SRE 2006 corpus. It is shown that the proposed impostor selection algorithm and UDBN adaptation process enhance the performance of conventional DNNs 8-20% and 16-20% in terms of EER for the single and multi-session tasks, respectively. In both scenarios, the proposed architectures outperform the baseline systems obtaining up to 17% reduction in EER.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Javier Hernando

Acoustic Beamforming for Speaker Diarization of Meetings

Time and frequency filtering of filter-bank energies for robust HMM speech recognition

Deep belief networks for i-vector based speaker recognition

Self Multi-Head Attention for Speaker Recognition

A deep analysis on age estimation

Frequency and time filtering of filter-bank energies for HMM speech recognition

Using Jitter and Shimmer in speaker verification

Deep Learning Backend for Single and Multisession i-Vector Speaker Recognition

Contact Info

Product

Resources

About