Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-1305
|View full text |Cite
|
Sign up to set email alerts
|

CALYOU: A Comparable Spoken Algerian Corpus Harvested from YouTube

Abstract: This paper addresses the issue of comparability of comments extracted from Youtube. The comments concern spoken Algerian that could be either local Arabic, Modern Standard Arabic or French. This diversity of expression gives rise to a huge number of problems concerning the data processing. In this article, several methods of alignment will be proposed and tested. The method which permits to best align is Word2Vecbased approach that will be used iteratively. This recurrent call of Word2Vec allows us improve sig… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
2

Relationship

3
4

Authors

Journals

citations
Cited by 14 publications
(18 citation statements)
references
References 8 publications
0
18
0
Order By: Relevance
“…The resulting corpus contains 5.6 million sentences Suwaileh et al (Suwaileh et al, 2016) Saad and Alijla (2017) propose the construction of a comparable Wikipedia corpus between the Arabic and Egyptian dialects. This corpus contains 10,197 aligned documents Another comparable corpus (CALYOU) was constructed by Abidi et al (2017). This corpus is dedicated to spoken Algerian To align messages, the authors used different approaches such as: dictionary based, indexing words by their sounds and finally an approach based on the similarity proposed by word2vec (Mikolov et al, 2013).…”
Section: Building Resourcesmentioning
confidence: 99%
See 1 more Smart Citation
“…The resulting corpus contains 5.6 million sentences Suwaileh et al (Suwaileh et al, 2016) Saad and Alijla (2017) propose the construction of a comparable Wikipedia corpus between the Arabic and Egyptian dialects. This corpus contains 10,197 aligned documents Another comparable corpus (CALYOU) was constructed by Abidi et al (2017). This corpus is dedicated to spoken Algerian To align messages, the authors used different approaches such as: dictionary based, indexing words by their sounds and finally an approach based on the similarity proposed by word2vec (Mikolov et al, 2013).…”
Section: Building Resourcesmentioning
confidence: 99%
“…In the context of MT, Meftouh et al (2018) used a phrasebased MT system, GIZA++ (Och and Ney, 2003) for alignment and SRILM toolkit (Stolcke, 2002) The best results that these authors obtained were between the Algiers dialect and the dialect of Annaba (with BLEU score up to 67.31) which is perfectly understandable where both dialects are spoken into the same country (Algeria). To the best of our knowledge, Saad and Alijla (2017), Abidi et al (2017), Bouamor et al (2018), Kumar et al (2014) have not proposed any system to validate their comparable and parallel corpora.…”
Section: Semantic-level Analysismentioning
confidence: 99%
“…-The comparable corpus CALYOU CALYOU 3 [1] is an Algerian dialect comparable corpus of Youtube comments. It was collected by querying Youtube with key-words related to current Algerian events.…”
Section: Data Descriptionmentioning
confidence: 99%
“…They are written in Arabic and Latin script. 1 Also, They are written sometimes with a mixture of letters and numbers. Arab people exploit the similarity between some Arabic letters and numbers to write the dialect, for example similarity between 3 and , 7 and and 9 and .…”
Section: Introductionmentioning
confidence: 99%
“…That is why, our first objective is to provide a code-switched dataset. Code-switched corpora are either generated in an artificial way [13,12] or collected from social media and/or online texts [1]. Once the data are collected, the processing of code-switching is carried out by adapting the existing models and tools or by proposing new ones.…”
Section: Introductionmentioning
confidence: 99%