Monaural multi-talker speech recognition using factorial speech processing models

Khademian, Mahdi; Homayounpour, Mohammad Mehdi

doi:10.1016/j.specom.2018.01.007

Cited by 13 publications

(15 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the second scenario, two or more speaker voices are mixed together to produce a multi-talker speech utterance in which underlying source processes are the speakers' voices. In this case, previous achievements are surprising [6,7], even better than the results achieved manually by human listening (Fig. 8-left).…”

Section: Introductionmentioning

confidence: 53%

“…In fact, we assume that the initial solution to the system of equations with approximate joint-posteriors can be improved iteratively during the discriminative phase using marginal posteriors. Based on this assumption, we propose the following three steps for training a deep neural network for extracting joint-state posteriors: the generative phase, initializing joint-state layer weights, and 7 fine-tuning the network. Fig.…”

Section: Joint-state Posterior Estimation Using Deep Neural Networkmentioning

confidence: 99%

“…The joint-decoder objective is to find the most probable word sequences of the two speakers given features of the mixed-audio signal; we use all allowed information during the decoding in which the target speaker uses the word "white" in his command and the masker does not. The decoder solves the following optimization problem: Decoding of this task is done using a joint decoder implemented by the joint token passing algorithm proposed by [7]. The decoder reads HTK HParse generated wordnets generated based on the task grammar.…”

Section: Joint-decodingmentioning

confidence: 99%

See 2 more Smart Citations

Feature joint-state posterior estimation in factorial speech processing models using deep neural networks

Khademian

Homayounpour

2017

Computers & Electrical Engineering

Self Cite

View full text Add to dashboard Cite

This paper proposes a new method for calculating joint-state posteriors of mixed-audio features using deep neural networks to be used in factorial speech processing models. The joint-state posterior information is required in factorial models to perform joint-decoding. The novelty of this work is its architecture which enables the network to infer joint-state posteriors from the pairs of state posteriors of stereo features. This paper defines an objective function to solve an underdetermined system of equations, which is used by the network for extracting joint-state posteriors. It develops the required expressions for fine-tuning the network in a unified way. The experiments compare the proposed network decoding results to those of the vector Taylor series method and show 2.3% absolute performance improvement in the monaural speech separation and recognition challenge. This achievement is substantial when we consider the simplicity of joint-state posterior extraction provided by deep neural networks.

show abstract

Section: Introductionmentioning

confidence: 53%

Section: Joint-state Posterior Estimation Using Deep Neural Networkmentioning

confidence: 99%

Section: Joint-decodingmentioning

confidence: 99%

See 1 more Smart Citation

Feature joint-state posterior estimation in factorial speech processing models using deep neural networks

Khademian

Homayounpour

2017

Computers & Electrical Engineering

Self Cite

View full text Add to dashboard Cite

show abstract

“…VidTIMIT database covers 40 speaker's (22 guys and 18 females) as well as subset of this database having 30 speaker's (15 guys and 15 female's speaker's) was utilized in work depicted in this article. Every speaker expresses eight distinct sentences before a camera fixated on substance of speaker, and the sentences in database are on the whole instances of persistent discourse booked from the standard VidTIMIT database as well as comprise an aggregate of the 210 expressions and the terms of 920 words, and the sound is recorded at the test rate of 64 KHz and 32 bits profundity; video is recorded at the rate of 24 outlines for each second [10].…”

Section: Database Techniquesmentioning

confidence: 99%

An Assessment of the Visual Features Extractions for the Audio-Visual Speech Recognition

Mohmand¹,

Perbandaran²

2019

IJATCSE

View full text Add to dashboard Cite

Utilization of the visual data from the speakers mouth region has appeared to develop presentation of the Automatic Speech-Recognition ASR frameworks. This is the particularly valuable in nearness of the clamor, which uniform in the moderate structure seriously debases discourse acknowledgment execution of frameworks utilizing just sound data. Different arrangements of highlights separated from speakers mouth area have been utilized to improve the showing of an ASR framework. In such testing situations and have met various triumphs, and to the best of creators information, the impact of utilizing these methods on the acknowledgment execution based on the phonemes have not been examined at this point. This paper presents examination of phoneme acknowledgement execution utilising visual highlights removed from mouth area of-enthusiasm utilising discrete cosine transform and discrete wavelet transform. Therefore, new discrete cosine transform and discrete wavelet transform feature have likewise been extricated and contrasted and the recently utilized one. These highlights were utilized alongside sound highlights dependent on the Mel-Frequency Cepstral Coefficients MFCCs. This recent research will help in the choosing appropriate feature for various application as well as distinguish the restrictions of these techniques in the acknowledgment of the individual-phonemes.

show abstract

“…Speech recognition technology has applications in different systems, such as automatic translation telephones, question and answer machines, and intelligence decisions support systems [1][2][3][4]. The mechanism of speech recognition lies in the separation of the words and the matching of patterns between the words in the speech and the words in a dictionary [5][6].…”

Section: Introductionmentioning

confidence: 99%

Reliability Modeling of Speech Recognition Tasks

Qiu¹

2018

IJPE

View full text Add to dashboard Cite

Speech recognition is becoming the key technology of man-machine interfaces in information technology. The application of voice technology has become a competitive and new high-tech industry. However, due to the big volume of vocabulary, continuous voice, and personalized accents, it is hard to make speech recognition completely accurate. In this paper, a reliability model is proposed to measure the performance of speech recognition. In particular, two types of task failures are suggested and an iterative approach is adopted. Numerical examples are proposed for illustrative purposes.

show abstract

Monaural multi-talker speech recognition using factorial speech processing models

Cited by 13 publications

References 17 publications

Feature joint-state posterior estimation in factorial speech processing models using deep neural networks

Feature joint-state posterior estimation in factorial speech processing models using deep neural networks

An Assessment of the Visual Features Extractions for the Audio-Visual Speech Recognition

Reliability Modeling of Speech Recognition Tasks

Contact Info

Product

Resources

About