2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015
DOI: 10.1109/icassp.2015.7178847
|View full text |Cite
|
Sign up to set email alerts
|

Speech acoustic modeling from raw multichannel waveforms

Abstract: Standard deep neural network-based acoustic models for automatic speech recognition (ASR) rely on hand-engineered input features, typically log-mel filterbank magnitudes. In this paper, we describe a convolutional neural network -deep neural network (CNN-DNN) acoustic model which takes raw multichannel waveforms as input, i.e. without any preceding feature extraction, and learns a similar feature representation through supervised training.By operating directly in the time domain, the network is able to take ad… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

8
142
2

Year Published

2016
2016
2020
2020

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 190 publications
(152 citation statements)
references
References 12 publications
8
142
2
Order By: Relevance
“…In addition, our research also differs from [28] because we use single-label data for training and estimate multi-label data, while they used multilabel data from the training phase. Moreover, they focused on an end-to-end approach, which is promising in that using raw audio signals makes the system rely less on domain knowledge and preprocessing, but usually it shows a slightly lower performance than using spectral input such as melspectrogram in recent papers [29], [30].…”
Section: Proliferation Of Deep Neural Network Inmentioning
confidence: 99%
“…In addition, our research also differs from [28] because we use single-label data for training and estimate multi-label data, while they used multilabel data from the training phase. Moreover, they focused on an end-to-end approach, which is promising in that using raw audio signals makes the system rely less on domain knowledge and preprocessing, but usually it shows a slightly lower performance than using spectral input such as melspectrogram in recent papers [29], [30].…”
Section: Proliferation Of Deep Neural Network Inmentioning
confidence: 99%
“…To perform the spatial feature learning jointly with the rest of the network, we propose to learn spatial features directly from multi-channel waveforms with an integrated architecture. The main idea is to learn time-domain filters spanning all signal channels to perform adaptive spatial filtering [13][14][15]. These filters parameters are jointly optimized with the encoder using Eq.…”
Section: Spatial Feature Learningmentioning
confidence: 99%
“…When A (k) = 0 and c (k) = log π j,z for all k, the new update (16) simplifies to the original variational update (9). Although the synchronous mean-field updates break the variational bound, we expect discriminative training to compensate such approximations.…”
Section: Mrf Extension Of the Mcgmmmentioning
confidence: 99%
“…Swietojanski et al [15] proposed a convolutional neural network (CNN) architecture for ASR using multichannel audio, where different microphone channels were pooled together. Hoshen et al [16] used a CNN-DNN for acoustic modeling on raw time-domain multichannel audio. Nugraha et al [17] achieved improved source separation for two-channel music recordings using alternating ReLU layers and channel estimation.…”
Section: Introduction and Relation To Prior Workmentioning
confidence: 99%
See 1 more Smart Citation