Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset

Zhang, Yang; Chen, Yifan; Li, Luo; Yang, Runyan; Lingxuan, Ye,; Cheng, Gaofeng; Xu, Jianzhong; Jin, Yaohui; Zhang, Qingqing; Zhang, Pengyuan; Xie, Lihua; Yan, Yonghong

doi:10.48550/arxiv.2203.16844

Cited by 2 publications

(3 citation statements)

References 31 publications

(34 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• R05: MAGICDATA Mandarin Chinese Conversational Speech Corpus [22]. This dataset contains 180 hours of conversational speech from 633 speakers, recorded by mobile phone in a quiet environment.…”

Section: Design Policymentioning

confidence: 99%

FAD: A Chinese Dataset for Fake Audio Detection

Ma¹,

Yi²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Fake audio detection is a growing concern and some relevant datasets have been designed for research. But there is no standard public Chinese dataset under additive noise conditions. In this paper, we aim to fill in the gap and design a Chinese fake audio detection dataset (FAD) for studying more generalized detection methods. Twelve mainstream speech generation techniques are used to generate fake audios. To simulate the real-life scenarios, three noise datasets are selected for noisy adding at five different signal noise ratios. FAD dataset can be used not only for fake audio detection, but also for detecting the algorithms of fake utterances for audio forensics. Baseline results are presented with analysis. The results that show fake audio detection methods with generalization remain challenging. The FAD dataset is publicly available 1 .Recently, fake audio detection is not limited to ASV system, but also starts to focus on real-life scenarios. The first audio deep synthesis detection challenge [12] (ADD 2022) focuses on challenging situations, including low-quality fake audio and partially fake audio detection. More datasets are constrcted by deep-learning speech techniques, such as: FoR [13], WaveFake [14], and HAD [15] datasets.These above-mentioned datasets facilitate the progress of the fake audio detection research. However, in practical applications, audios on social media come in many languages with noisy background and the type of fake audio may be unknown to the model. Those various factors greatly influence the performance of the detection models. The generalization of the detection models is still an urgent need to address. Specifically, the generalization includes generalization to unknown types and robustness to noise and other factors. Most datasets focus on the evaluation of the former aspect, 1 https://zenodo.org/record/6635521#.Ysjq4nZBw2x

show abstract

Section: Design Policymentioning

confidence: 99%

FAD: A Chinese Dataset for Fake Audio Detection

Ma¹,

Yi²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…We evaluate our proposed method on two Chinese conversation datasets: MagicData-RAMC [44] and HKUST [45]. The HKUST dataset comprises telephone recordings of conversations, while the MagicData-RAMC dataset consists of microphone recordings of conversations captured in a quiet environment.…”

Section: A Datasetmentioning

confidence: 99%

“…1) MagicData-RAMC: The MagicData-RAMC dataset [44] comprises 180 hours of Chinese conversational speech data, distributed as 150 hours for the training set, 20 hours for the development set, and 10 hours for the test set. The dataset features conversations from 663 distinct speakers.…”

Section: A Datasetmentioning

confidence: 99%

Leveraging Acoustic Contextual Representation by Audio-textual Cross-modal Learning for Conversational ASR

Zhang¹,

Sining²,

Xie³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel Conversational ASR system, extending the Conformer encoderdecoder model with cross-modal conversational representation. Our approach leverages a cross-modal extractor that combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. This enables the extraction of richer historical speech context without explicit error propagation. We also incorporate conditional latent variational modules to learn conversational level attributes such as role preference and topic coherence. By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss, achieving relative accuracy improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and MagicData-RAMC, respectively, compared to the standard Conformer model.

show abstract

Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset

Cited by 2 publications

References 31 publications

FAD: A Chinese Dataset for Fake Audio Detection

FAD: A Chinese Dataset for Fake Audio Detection

Leveraging Acoustic Contextual Representation by Audio-textual Cross-modal Learning for Conversational ASR

Contact Info

Product

Resources

About