2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Sys 2017
DOI: 10.1109/icsda.2017.8384470
|View full text |Cite
|
Sign up to set email alerts
|

Multiresolution CNN for reverberant speech recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
6
2
2

Relationship

0
10

Authors

Journals

citations
Cited by 19 publications
(8 citation statements)
references
References 7 publications
0
7
0
Order By: Relevance
“…It is a typical ill-posed inverse problem to infer and synthesize high resolution images from observed low resolution images. Existing algorithms can be divided into two categories according to technical means: reconstruction-based methods and learning-based methods [23]. Image reconstruction-based SR method usually requires sub-pixel alignment of LR image sequences to obtain motion offset between HR images, thus constructing spatial motion parameters in the observation model, and applying different constraints to solve HR images.…”
Section: Related Workmentioning
confidence: 99%
“…It is a typical ill-posed inverse problem to infer and synthesize high resolution images from observed low resolution images. Existing algorithms can be divided into two categories according to technical means: reconstruction-based methods and learning-based methods [23]. Image reconstruction-based SR method usually requires sub-pixel alignment of LR image sequences to obtain motion offset between HR images, thus constructing spatial motion parameters in the observation model, and applying different constraints to solve HR images.…”
Section: Related Workmentioning
confidence: 99%
“…Then, using the preprocessed version of the audio signal, some spectral or temporal features can be extracted via Mel-frequency Cepstral Coefficients (MFCC) [126][127][128] or Discrete Wavelet Transform (DWT) [129][130][131]. The extracted features are passed through a prediction module that employs Hidden Markov Models (HMMs) [132,133], SVMs [134][135][136], RNNs [137][138][139], or CNNs [140][141][142][143], among others, to obtain the text equivalence in the desired language restricted by a predefined vocabulary and grammar rules. More details about ASR can be found in the pertinent survey papers [124,144].…”
Section: Automatic Speech Recognitionmentioning
confidence: 99%
“…In a picture, multiple resolutions can be helpful to recognize objects at different scales [21], [22], but the desired benefit when using more than one resolution in audio applications is to exploit different details of the feature maps with each resolution point. For instance, the use of two different resolutions has been proposed to improve automatic speech recognition in reverberant scenarios [23], in which a wide-context window gives information about the acoustic environment and reverberation, whereas a narrow-context window provides finer detail about the content of the speech signal. This is possible due to the existence of a tradeoff between time resolution and frequency resolution in the extraction of Fast Fourier Transform-based audio features [24] such as the mel-spectrogram, which is also the base for the analysis proposed in this work.…”
Section: Introductionmentioning
confidence: 99%