Multimodal Fusion of Multirate Acoustic, Prosodic, and Lexical Speaker Characteristics for Native Language Identification

Shivakumar, Prashanth Gurunath; Chakravarthula, Sandeep Nallan; Georgiou, Panayiotis G.

doi:10.21437/interspeech.2016-1312

Cited by 11 publications

(9 citation statements)

References 19 publications

(32 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The most confusable classification are between Hindi and Telugu, which are both languages used in India. Similar observations were found in the systems reported in [11,12], although with a lower count.…”

Section: Results and Analysissupporting

confidence: 89%

“…It indicates that 600-dim i-Vector extracted from the posterior supervector of GMM with 1,024 Gaussian components can achieve the best performance. These results are similar to those that the Challenge systems [11,12] obtained, i.e., approximately 76% for both UAR and Acc by only using corpus, NNSE, provided by ComParE organizers. Thereafter, the dimension of i-vector is fixed to 600.…”

Section: Results and Analysissupporting

confidence: 86%

“…Our system with TDNN based i-Vector trained by Librispeech can achieve 82.9% UAR, which is 1.6% better than the L 2 F system which fuses two i-vector based systems. In [12], Librispeech is also used to train a phone recognizer for extracting L1pronunciation projection features which are used for i-vector extraction. The fusion system achieves 79.9% Acc and 80.1% UAR.…”

Section: Results and Analysismentioning

confidence: 99%

“…This PLLR based i-vector system is fused together with the MFCC feature based i-vector system at the scoring level and achieves the best performance in ComParE. Various features at the frame, phone and lexical levels, multiple classifiers, SVM, PLDA and DNN, and fusion at both feature and score levels are investigated in detail [12]. The performance of the final fused system is just slightly worse than the best system.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Improving Sub-Phone Modeling for Better Native Language Identification with Non-Native English Speech

Evanini

Wang

et al. 2017

Interspeech 2017

View full text Add to dashboard Cite

Identifying a speaker's native language with his speech in a second language is useful for many human-machine voice interface applications. In this paper, we use a sub-phone-based i-vector approach to identify non-native English speakers' native languages by their English speech input. Time delay deep neural networks (TDNN) are trained on LVCSR corpora for improving the alignment of speech utterances with their corresponding sub-phonemic "senone" sequences. The phonetic variability caused by a speaker's native language can be better modeled with the sub-phone models than the conventional phone model based approach. Experimental results on the database released for the 2016 Interspeech ComParE Native Language challenge with 11 different L1s show that our system outperforms the best system by a large margin (87.2% UAR compared to 81.3% UAR for the best system from the 2016 ComParE challenge).

show abstract

Section: Results and Analysissupporting

confidence: 89%

Section: Results and Analysissupporting

confidence: 86%

Section: Results and Analysismentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Improving Sub-Phone Modeling for Better Native Language Identification with Non-Native English Speech

Evanini

Wang

et al. 2017

Interspeech 2017

View full text Add to dashboard Cite

show abstract

“…For example, Grèzes et al calculated the ratio of speaker overlap to aid conflict intensity estimation [13]; Montacié and Caraty detected temporal events (e.g. speech onset latency, event starting time-codes, pause and phone segments) to detect cognitive load [14], several authors extracted phone posterior-based attributes to determine the degree of nativeness or the native language of the speaker [15,16,17], while Huckvale and Beke developed specific spectral-based attributes to detect whether the speaker has a cold [18]. Of course, some kind of fusion of the general and the task-specific attributes might also prove to be beneficial.…”

Section: Introductionmentioning

confidence: 99%

Applying Speech Tempo-Derived Features, BoAW and Fisher Vectors to Detect Elderly Emotion and Speech in Surgical Masks

Gosztolya,

Tóth

2020

Preprint

View full text Add to dashboard Cite

The 2020 INTERSPEECH Computational Paralinguistics Challenge (ComParE) consists of three Sub-Challenges, where the tasks are to identify the level of arousal and valence of elderly speakers, determine whether the actual speaker wearing a surgical mask, and estimate the actual breathing of the speaker. In our contribution to the Challenge, we focus on the Elderly Emotion and the Mask sub-challenges. Besides utilizing standard or close-to-standard features such as ComParE functionals, Bag-of-Audio-Words and Fisher vectors, we exploit that emotion is related to the velocity of speech (i.e. speech rate). To utilize this, we perform phone-level recognition using an ASR system, and extract features from the output such as articulation tempo, speech tempo, and various attributes measuring the amount of pauses. We also hypothesize that wearing a surgical mask makes the speaker feel uneasy, leading to a slower speech rate and more hesitations; hence, we experiment with the same features in the Mask sub-challenge as well. Although this theory was not justified by the experimental results on the Mask Sub-Challenge, in the Elderly Emotion Sub-Challenge we got significantly improved arousal and valence values with this feature type both on the development set and in cross-validation.

show abstract

Multi-level Fusion of Fisher Vector Encoded BERT and Wav2vec 2.0 Embeddings for Native Language Identification

Krebbers

Kaya

Karpov

2022

Speech and Computer

View full text Add to dashboard Cite

Native Language Identification is a prominent paralinguistic study with applications ranging from biometric analysis to speaker adaptation. Former studies on this task have benefited from alternative acoustic feature representations and pre-trained neural networks. In this work, we explore the Native Language Identification performance of contextual acoustic (wav2vec 2.0) and linguistic (BERT) embeddings as state-of-the-art feature representations and combine them with acoustic features at different levels. We encode acoustic and linguistic features using Fisher Vectors, applying Fisher Vector encoding on BERT word embeddings and wav2vec 2.0 for the first time for a paralinguistic task. We compare this approach with conventional functional summarization. In line with our former study using only acoustic modality, the results indicate the superiority of Fisher Vectors encoding over the traditional techniques. Moreover, we show the efficacy of combining alternative representations now in both acoustic and linguistic modalities. Results indicate a notable contribution of the transformer-based contextual auditory and linguistic feature representations to bimodal Native Language Identification systems.

show abstract

Multimodal Fusion of Multirate Acoustic, Prosodic, and Lexical Speaker Characteristics for Native Language Identification

Cited by 11 publications

References 19 publications

Improving Sub-Phone Modeling for Better Native Language Identification with Non-Native English Speech

Improving Sub-Phone Modeling for Better Native Language Identification with Non-Native English Speech

Applying Speech Tempo-Derived Features, BoAW and Fisher Vectors to Detect Elderly Emotion and Speech in Surgical Masks

Multi-level Fusion of Fisher Vector Encoded BERT and Wav2vec 2.0 Embeddings for Native Language Identification

Contact Info

Product

Resources

About