Shifted-Delta MLP Features for Spoken Language Recognition

Wang, Haipeng; Leung, Cheung-Chi; Lee, Tan; Ma, Bin; Li, Haizhou

doi:10.1109/lsp.2012.2227312

Cited by 40 publications

(27 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Instead of speech based signals [10], [15], we propose text based comments as a new signal for audio LID of the videos. LID for text data has a wide array of applications ranging across Machine Translation for online resources [11] and building linguistic resources from the web [1].…”

Section: Related Researchmentioning

confidence: 99%

Text based user comments as a signal for automatic language identification of online videos

Doğruöz¹,

Ponomareva

Girgin³

et al. 2017

Proceedings of the 19th ACM International Conference on Multimodal Interaction

View full text Add to dashboard Cite

Identifying the audio language of online videos is crucial for industrial multi-media applications. Automatic speech recognition systems can potentially detect the language of the audio. However, such systems are not available for all languages. Moreover, background noise, music and multi-party conversations make audio language identification hard. Instead, we utilize text based user comments as a new signal to identify audio language of YouTube videos. First, we detect the language of the text based comments. Augmenting this information with video meta-data features, we predict the language of the videos with an accuracy of 97% on a set of publicly available videos. The subject matter discussed in this research is patent pending. CCS CONCEPTS· Information systems → Multilingual and cross-lingual retrieval;

show abstract

Section: Related Researchmentioning

confidence: 99%

Text based user comments as a signal for automatic language identification of online videos

Doğruöz¹,

Ponomareva

Girgin³

et al. 2017

Proceedings of the 19th ACM International Conference on Multimodal Interaction

View full text Add to dashboard Cite

show abstract

“…Following the results reported in [14] and [17], where the accuracy of a LID system was improved thanks to the dimensionality reduction of the PLLR features using PCA, for our experiments we also tested different dimensionality reduction techniques such as HLDA [18]. In this case, the dimensionality reduction was applied for the baseline PLLR features as well as for the state-based PLLR features.…”

Section: Dimensionality Reduction Techniquesmentioning

confidence: 99%

“…We apply the windowing concepts from SDC to the PLLR features, obtaining what we call Shifted Delta PLLR Coefficients (SDPC) and then we apply a PCA projection as in [17] because in this case dimensionality reduction is a must with the high dimensionality vectors that we have to manage (for instance, 177 states in the Hungarian recognizer with a SDC 1_5_3 will result in a vector of dimension 708). We compared using first the PCA reduction and then stacking the SDPC or first stacking the SDPC and then applying PCA.…”

Section: Modification Using Sdpc Parametersmentioning

confidence: 99%

Extended phone log-likelihood ratio features and acoustic-based i-vectors for language recognition

D’Haro

Córdoba

Salamea

et al. 2014

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper presents new techniques with relevant improvements added to the primary system presented by our group to the Albayzin 2012 LRE competition, where the use of any additional corpora for training or optimizing the models was forbidden. In this work, we present the incorporation of an additional phonotactic subsystem based on the use of phone log-likelihood ratio features (PLLR) extracted from different phonotactic recognizers that contributes to improve the accuracy of the system in a 21.4% in terms of C avg (we also present results for the official metric during the evaluation, F act ). We will present how using these features at the phone state level provides significant improvements, when used together with dimensionality reduction techniques, especially PCA. We have also experimented with applying alternative SDC-like configurations on these PLLR features with additional improvements. Also, we will describe some modifications to the MFCC-based acoustic i-vector system which have also contributed to additional improvements. The final fused system outperformed the baseline in 27.4% in C avg .

show abstract

“…Then the posterior features were transformed by taking logarithm, PCA transformation, and MVN. Quantitative analysis in [6] has shown that the Log-MLP features are more robust than spectral features, and are suitable for Gaussian modeling.…”

Section: Tokenizer Implementationmentioning

confidence: 99%

“…This framework utilizes a tokenizer to convert both the query examples and the test utterances into posteriorgrams, and matches the query posteriorgrams with the test posteriorgrams using dynamic time warping (DTW), which has been widely used in template-based speech recognition. Posteriorgram representation is believed to be more robust and more informative than spectral features [5,6].…”

Section: Introductionmentioning

confidence: 99%

Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection

Wang

Lee

Leung

et al. 2013

2013 IEEE International Conference on Acoustics, Speech and Signal Processing

Self Cite

View full text Add to dashboard Cite

Recently the posteriorgram-based template matching framework has been successfully applied to query-by-example spoken term detection tasks for low-resource languages. This framework employs a tokenizer to derive posteriorgrams, and applies dynamic time warping (DTW) to the posteriorgrams to locate the possible occurrences of a query term. Based on this framework, we propose to improve the detection performance by using multiple tokenizers with DTW distance matrix combination. The proposed approach uses multiple tokenizers in parallel as the front-end to generate different posteriorgram representations, and combines the distance matrices of the different posteriorgrams into a single matrix. DTW detection is then applied to the combined distance matrix. Lastly score post-processing techniques including pseudo-relevance feedback and score normalization are used for further improvement. Experiments were conducted on the spoken web search datasets of MediaEval 2011 and MediaEval 2012. Experimental results show that combining multiple tokenizers significantly outperforms the best single tokenizer, and that the DTW matrix combination method consistently outperforms the score combination method when more than three tokenizers are involved. Score post-processing techniques show further gains on top of using multiple tokenizers.

show abstract

Shifted-Delta MLP Features for Spoken Language Recognition

Cited by 40 publications

References 11 publications

Text based user comments as a signal for automatic language identification of online videos

Text based user comments as a signal for automatic language identification of online videos

Extended phone log-likelihood ratio features and acoustic-based i-vectors for language recognition

Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection

Contact Info

Product

Resources

About