Neural network-based non-intrusive speech quality assessment using attention pooling function

Liu, Miao; Wang, Jing; Yi, Weiming; Liu, Fang

doi:10.1186/s13636-021-00209-4

Cited by 3 publications

(3 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Liu et al [25] has presented non-intrusive speech quality assessment depend on DNN for speech communication. Here, describes DL-depend strategy uses large-scale intrusive simulated data to increase accurateness, generalizability of non-intrusive techniques.…”

Section: Literature Reviewmentioning

confidence: 99%

English Speaking Assessment Algorithm Based on Deep Learning

Xiaoli Hu

2024

jes

View full text Add to dashboard Cite

English as a Foreign Language (EFL) students perform when speaking in public. An increasingly globalized world, effective public speaking is critical, but EFL students struggle to perform it, despite importance of qualities such as eye contact, speech pauses, there is presently no objective examination of such elements. A summative assessment has historically been the predominant form of evaluation in college English speaking assessments. Exam-centric teaching has considerable negative effect on foreign language training. In this research work, English Speaking Assessment Algorithm Based on Deep Learning (ESA-NEGCN-NBOA) is proposed. Initially, input video data are gathered from the multiple video dataset (MVD).The input video data is then pre-processed using Deep Attentional Guided Image Filtering (DAGIF) to remove presence of signal-dependent noise and improve lack of pixels from the regions and enhanced the video data. The data that has been pre-processed is utilized to Feature extraction using New General Double Integral Transform (NGDIT), which extract the significant attributes such as mel-frequency cepstral coefficients, energy, speech rate and pitch. Then NEGCN is proposed to improve students spoken English performance by assessmenting the English speakers. In general, NEGCN doesn’t express some adaption of optimization approaches for determining optimal parameters to promise exact improvement of assessment. Therefore, NBOA is proposed to enhance weight parameter of NEGCN for English speaking assessments, which precisely assess the English speaking. Performance measures such as accuracy, assessment error, evaluation time, pretest and posttest are examined when the proposed ESA-NEGCN-NBOA method is put into practice. The proposed ESA-NEGCN-NBOA method attains21.36%, 23.42% and 19.29% higher accuracy, 23.36%, 18.42% and 28.27% lower evaluation error, 20.36%, 27.42%, 28.17% lesser evaluation time analysed with existing techniques, likes innovative strategy towards oral English assessment utilizing machine learning, data mining, blockchain methods(IST-OEA-ML), machine learning assessment system for spoken English depend on linear predictive coding (AS-SE-LPC-ML), multimodal transfer learning for oral presentation assessment (MM-TL-OPA) respectively.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

English Speaking Assessment Algorithm Based on Deep Learning

Xiaoli Hu

2024

jes

View full text Add to dashboard Cite

show abstract

“…The work in [18]- [57] describes MLbased approaches to NR quality estimation. Some of these NR tools produce estimates of subjective test scores that report speech or sound quality mean opinion score (MOS) [18]- [20], [25]- [28], [31], [36], [40], [42], [43], [45], [49], [50], [57], naturalness [29], [35], [37], listening effort [24], noise intrusiveness [50], and speech intelligibility [21], [33]. The non-intrusive speech quality assessment model called NISQA [53] uses log-mel-spectrograms to produce estimates of subjective speech quality as well as four constituent dimensions: noisiness, coloration, discontinuity, and loudness.…”

Section: A Existing Machine Learning Approachesmentioning

confidence: 99%

Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities

Catellier,

Voran

2023

IEEE Access

View full text Add to dashboard Cite

Speech quality and speech intelligibility can vary dramatically across the wide range of currently available telecommunications systems, devices, and operating environments. This creates a strong demand for efficient real-time measurements of quality and intelligibly. Wideband Audio Waveform Evaluation Networks (WAWEnets) are convolutional neural networks that operate directly on wideband audio waveforms in order to produce evaluations of those waveforms. In the present work these evaluations give qualities of telecommunications speech (e.g., noisiness, intelligibility, overall speech quality). WAWEnets are no-reference networks because they do not require ''reference'' (original or undistorted) versions of the waveforms they evaluate. Our initial WAWEnet publication introduced four WAWEnets and each emulated the output of an established full-reference speech quality or intelligibility estimation algorithm. We have updated the WAWEnet architecture to be more efficient and effective. Here we present a single WAWEnet that closely tracks seven different quality and intelligibility values with per-segment correlations in the range of 0.92 to 0.96. We create a second network that additionally tracks four subjective speech quality dimensions. We offer a third network that focuses on just subjective quality scores and achieves a per-segment correlation of 0.97. The performance of our WAWEnet architecture compares favorably to models with orders-of-magnitude more parameters and computational complexity. This work has leveraged 334 hours of speech in 13 languages, over two million full-reference target values and over 93,000 subjective mean opinion scores. We also interpret the operation of WAWEnets and identify the key to their operation using the language of signal processing: ReLUs strategically move spectral information from non-DC components into the DC component. The DC values of 96 output signals define a vector in a 96-D latent space and this vector is then mapped to a quality or intelligibility value for the input waveform.

show abstract

“…As machine learning (ML) has become more powerful and accessible, numerous research groups have sought to apply ML to develop NR tools [17]- [50]. Some of these NR tools produce estimates of subjective test scores that report speech or sound quality mean opinion score (MOS) [17]- [19], [24]- [27], [30], [35], [38], [40], [41], [46], [47], naturalness [28], [34], [36], listening effort [23], noise intrusiveness [47], and speech intelligibility [20], [32]. The non-intrusive speech quality assessment model called NISQA [50] produces estimates of subjective speech quality as well as four constituent dimensions: noisiness, coloration, discontinuity, and loudness.…”

Section: A Existing Machine Learning Approachesmentioning

confidence: 99%

Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities

Catellier¹,

Voran²

2022

Preprint

View full text Add to dashboard Cite

Wideband Audio Waveform Evaluation Networks (WAWEnets) are convolutional neural networks that operate directly on wideband audio waveforms in order to produce evaluations of those waveforms. In the present work these evaluations give qualities of telecommunications speech (e.g., noisiness, intelligibility, overall speech quality). WAWEnets are no-reference networks because they do not require “reference” (original or undistorted) versions of the waveforms they evaluate. Our initial WAWEnet publication introduced four WAWEnets and each emulated the output of an established full-reference speech quality or intelligibility estimation algorithm. We have updated the WAWEnet architecture to be more efficient and effective. Here we present a single WAWEnet that closely tracks seven different quality and intelligibility values. We create a second network that additionally tracks four subjective speech quality dimensions. We offer a third network that focuses on just subjective quality scores and achieves very high levels of agreement. This work has leveraged 334 hours of speech in 13 languages, over two million full-reference target values and over 93,000 subjective mean opinion scores. We also interpret the operation of WAWEnets and identify the key to their operation using the language of signal processing: ReLUs strategically move spectral information from non-DC components into the DC component. The DC values of 96 output signals define a vector in a 96-D latent space and this vector is then mapped to a quality or intelligibility value for the input waveform.

show abstract

Neural network-based non-intrusive speech quality assessment using attention pooling function

Cited by 3 publications

References 23 publications

English Speaking Assessment Algorithm Based on Deep Learning

English Speaking Assessment Algorithm Based on Deep Learning

Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities

Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities

Contact Info

Product

Resources

About