2022
DOI: 10.1145/3511806
|View full text |Cite
|
Sign up to set email alerts
|

Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus

Abstract: Machine translation has been a major motivation of development in natural language processing. Despite the burgeoning achievements in creating more efficient machine translation systems, thanks to deep learning methods, parallel corpora have remained indispensable for progress in the field. In an attempt to create parallel corpora for the Kurdish language, in this article, we describe our approach in retrieving potentially alignable news articles from multi-language websites and manually align them across dial… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(2 citation statements)
references
References 17 publications
0
2
0
Order By: Relevance
“…For Jordanian Arabic, Talafha et al ( 2021) created a dataset in Arabic and a non-standard romanization known as Arabizi. Ahmadi et al (2022) compiled a corpus of Kurdish news articles written in the Sorani (Arabic-based) and Kurmanji (Latinbased) orthographies. More recently, Gow-Smith et al (2022) reconstructed part of a 16th-century Scottish Gaelic manuscript in modern orthography.…”
Section: Related Workmentioning
confidence: 99%
“…For Jordanian Arabic, Talafha et al ( 2021) created a dataset in Arabic and a non-standard romanization known as Arabizi. Ahmadi et al (2022) compiled a corpus of Kurdish news articles written in the Sorani (Arabic-based) and Kurmanji (Latinbased) orthographies. More recently, Gow-Smith et al (2022) reconstructed part of a 16th-century Scottish Gaelic manuscript in modern orthography.…”
Section: Related Workmentioning
confidence: 99%
“…But the system has been tested the tokenization methods for developing a neural machine translation. The authors in [26] described an approach for retrieving potentially-alignable articles of news from websites dependent on lexical similarity and transliteration of scripts. This corpus included 12,327 translation pairs for dialects Sorani, and Kurmanji as well as including 1,797 and 650 translation pairs in English-Sorani and English-Kurmanji.…”
Section: Literature Reviewmentioning
confidence: 99%