Jui Shah scite author profile

Phonetic segmentation plays a key role in developing various speech applications. In this work, we propose to use various features for automatic phonetic segmentation task for forced Viterbi alignment and compare their effectiveness. We propose to use novel multiscale fractal dimension-based features concatenated with MelFrequency Cepstral Coefficients (MFCC). The novel features are expected to capture additional nonlinearities in speech production which should improve the performance of segmentation task. However, to evaluate effectiveness of these segmentation algorithms, we require manual accurate phoneme-level labeled data which is not available for low resource languages such as Gujarati (a low resource language and one of the official languages of India). In order to measure effectiveness of various segmentation algorithms, HMM-based speech synthesis system (HTS) for Gujarati have been built. From the subjective and objective evaluations, it is observed that FD-based features for segmentation work moderately better than other state-ofthe-art features such as MFCC, Perceptual Linear Prediction Cepstral Coefficients (PLP-CC), Cochlear Filter Cepstral Coefficients (CFCC), and RelAtive SpecTrAl(RASTA)-based PLP-CC. The Mean Opinion Score (MOS) and the Degraded-MOS, which are the measures of naturalness indicate an improvement of 9.69% with the proposed features from the MFCC (which is found to be the best among the other features) based features.

show abstract

What Do Audio Transformers Hear? Probing Their Representations For Language Delivery & Structure

Kumar

Shah

Chen

et al. 2022

View full text Add to dashboard Cite

What All Do Audio Transformer Models Hear? Probing Acoustic Representations for Language Delivery and Its Structure

Kumar¹,

Shah²,

Shah³

et al. 2021

Preprint

View full text Add to dashboard Cite

In recent times, BERT based transformer models have become an inseparable part of the 'tech stack' of text processing models. Similar progress is being observed in the speech domain with a multitude of models observing state-of-the-art results by using audio transformer models to encode speech. This begs the question of what are these audio transformer models learning. Moreover, although the standard methodology is to choose the last layer embedding for any downstream task, but is it the optimal choice? We try to answer these questions for the two recent audio transformer models, Mockingjay and wave2vec2.0. We compare them on a comprehensive set of language delivery and structure features including audio, fluency and pronunciation features. Additionally, we probe the audio models' understanding of textual surface, syntax, and semantic features and compare them to BERT. We do this over exhaustive settings for native, non-native, synthetic, read and spontaneous speech datasets

show abstract

Effectiveness of Transfer Learning on Singing Voice Conversion in the Presence of Background Music

Rajpura

Shah

Patel

et al. 2020

View full text Add to dashboard Cite

IGAN: Intrusion Detection Using Anomaly-Based Generative Adversarial Network

Shah

Das

2021

View full text Add to dashboard Cite

Classification of imbalanced bioassay data with features learned using stacked autoencoder

Shah

Joshi

2023

View full text Add to dashboard Cite

Bioassay data classification is an important task in drug d iscovery. However, the data used in classificat ion is h ighly imbalanced, lead ing to inaccuracies in classification for the minority class. We propose a novel approach for classification in wh ich we train separate models by using different features that are derived by training stacked autoencoders (SA E). Experiments are performed on 7 b ioassay datasets, in wh ich each data file consists of feature descriptors for every compound along with class label of compound being active, or inact ive. We first perform data cleaning using borderline synthetic minority oversampling technique (SMOTE) followed by removing the Tomek links, and then learn different features hierarchically, based on the cleaned data or feature vectors. We then train separate costsensitive feed-forward neural network (FNN) classifiers using the hierarchical features in order to obtain the final classification. To increase the True Positive Rate (TPR), a test sample is labeled as active if at least one classifier predicts it as active. In this paper, we demonstrate that by data cleaning and learn ing separate classifiers one can improve the TPR and F1 score when compared to other mach ine learning approaches. To the best of our knowledge, the researchers have not yet attempted the use of SAE and FNN for classifying bioassay data.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jui Shah

Mspec-Net : Multi-Domain Speech Conversion Network

CinC-GAN for Effective F₀prediction for Whisper-to-Normal Speech Conversion

Effectiveness of multiscale fractal dimension-based phonetic segmentation in speech synthesis for low resource language

What Do Audio Transformers Hear? Probing Their Representations For Language Delivery & Structure

What All Do Audio Transformer Models Hear? Probing Acoustic Representations for Language Delivery and Its Structure

Effectiveness of Transfer Learning on Singing Voice Conversion in the Presence of Background Music

IGAN: Intrusion Detection Using Anomaly-Based Generative Adversarial Network

Classification of imbalanced bioassay data with features learned using stacked autoencoder

Contact Info

Product

Resources

About

Jui Shah

Mspec-Net : Multi-Domain Speech Conversion Network

CinC-GAN for Effective F0prediction for Whisper-to-Normal Speech Conversion

Effectiveness of multiscale fractal dimension-based phonetic segmentation in speech synthesis for low resource language

What Do Audio Transformers Hear? Probing Their Representations For Language Delivery & Structure

What All Do Audio Transformer Models Hear? Probing Acoustic Representations for Language Delivery and Its Structure

Effectiveness of Transfer Learning on Singing Voice Conversion in the Presence of Background Music

IGAN: Intrusion Detection Using Anomaly-Based Generative Adversarial Network

Classification of imbalanced bioassay data with features learned using stacked autoencoder

Contact Info

Product

Resources

About

CinC-GAN for Effective F₀prediction for Whisper-to-Normal Speech Conversion