Toward a Web-based Speech Corpus for Algerian Dialectal Arabic Varieties

Bougrine, Soumia; Chorana, Aicha; Lakhdari, Abdallah; Cherroun, Hadda

doi:10.18653/v1/w17-1317

Cited by 14 publications

(14 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Bougrine et al [9] introduced a preliminary version of KALAM'DZ; the corpus is limited to Web-based corpus of 8 Algerian dialects crawled from some Algerian TV and YouTube channels. The corpus encompasses eight major Algerian Arabic sub-dialects with 4881 speakers and more than 104.4 hours segmented to utterances of at least 6 sec.…”

Section: Related Workmentioning

confidence: 99%

“…With the recent development of datadriven and deep learning-oriented approaches, one of the most important aspects to consider is to have access to a substantial volume of representative data. Indeed, the notion of "More data is better data" was born with the success of automatic recognition [9] where important amounts of training data are required [18]. The performance of systems depends mainly on their training corpus characteristics, which makes them an integral part of recognition systems [27].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The voice as a material clue: a new forensic Algerian Corpus

Zergat

Selouani

Amrouche

et al. 2023

Multimed Tools Appl

View full text Add to dashboard Cite

Dialects have received bigger interest in recent years as they are increasingly used on the web and social media. Because Algerian Arabic dialects suffer from a lack of appropriate speech corpora for speech recognition, a rich dialect corpus is needed to approach Algerian Accent recognition. The latter remains a key feature in the field of Forensic Voice Comparison (FVC) systems. This paper presents a new large-scale forensic Algerian speech corpus called Sawt El-Djazaïr. An important criterion in dealing with forensic corpora is the presence of session variability. For this purpose, we collected celebrity recordings in various regions of Algeria, from different social networks, in various scenarios, and at different times. In addition, we also recorded 87 participants using cellular calls and voice over IP (VoIP) applications including Viber, WhatsApp, and Google Meet. The corpus of approximately 50 hours covers various speech topics and is spoken in twelve Algerian sub-dialects. The design guidelines of the proposed corpus are described along with the grouping of dialects across different geographical locations. Sawt El-Djazaïr is available to the research community upon request.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

The voice as a material clue: a new forensic Algerian Corpus

Zergat

Selouani

Amrouche

et al. 2023

Multimed Tools Appl

View full text Add to dashboard Cite

show abstract

“…The system achieved a total accuracy of 62.75% compared to 60.2% that was achieved by a similar system in [6]. For the Algerian Arabic dialect, a deep neural network based approach was introduced in [7] to evaluate a web based corpus for the dialects of Algeria KALAM'DZ [8]. The results showed that the DNN based approach and the support vector based approach performed similarly.…”

Section: Related Workmentioning

confidence: 99%

Automatic Dialect identification of Spoken Arabic Speech using Deep Neural Networks

Abdelazim¹,

Hussein

Badr

2022

IJICIS

View full text Add to dashboard Cite

Dialect identification is considered a subtask of the language identification problem and it is thought to be a more complex case due to the linguistic similarity between different dialects of the same language. In this paper, a novel approach is introduced for identifying three of the most used Arabic dialects: Egyptian, Levantine, and Gulf dialects. In this study, four experiments were conducted using different classification approaches that vary from simple classifiers such as Gaussian Naïve Bayes and Support Vector Machines to more complex classifiers using Deep Neural Networks (DNN). A features vector of 13 Mel cepstral coefficients (MFCCs) of the audio signals was used to train the classifiers using a multi-dialect parallel corpus. The experimental results showed that the proposed convolutional neural networks-based classifier has outperformed other classifiers in all three dialects. It has achieved an average improvement of 0.16, 0.19, and 0.19 in the Egyptian dialect, and of 0.07, 0.13, and 0.1 in the Gulf dialect, and of 0.52, 0.35, and 0.49 in the Levantine dialect for the Precision, recall and f1score metrics respectively.

show abstract

“…Compared with Twitter and Facebook, YouTube has been less examined by researchers; thus, previous research has not developed a best-practice scraping procedure for YouTube. The programme we used, youtube-dl (see https:// rg3.github.io/youtube-dl/), has been used by a number of other studies (Botta et al 2016;Bougrine et al 2017;Tomàs-Buliart et al 2010;Schwemmer and Ziewiecki 2018). After entering a predetermined list of channel names and fields of information (e.g.…”

Section: For the Futurementioning

confidence: 99%

Trade Unions on YouTube

Jansson

Uba

2019

View full text Add to dashboard Cite

use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

show abstract

Toward a Web-based Speech Corpus for Algerian Dialectal Arabic Varieties

Cited by 14 publications

References 11 publications

The voice as a material clue: a new forensic Algerian Corpus

The voice as a material clue: a new forensic Algerian Corpus

Automatic Dialect identification of Spoken Arabic Speech using Deep Neural Networks

Trade Unions on YouTube

Contact Info

Product

Resources

About