Neural Machine Translation for Low-resource English-Bangla

Mumin, Mohammad Abdullah Al; Seddiqui, Hanif; Iqbal, M. Zafar; Islam, Jakiul

doi:10.3844/jcssp.2019.1627.1637

Cited by 17 publications

(12 citation statements)

References 27 publications

(36 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A few parallel corpora are available for Bangla-English MT. In this study, SUPara (Al Mumin et al, 2012) dataset is used as a number of recent studies have used this corpus (Al Mumin et al, 2019aMumin et al, , 2019bHasan et al, 2019aHasan et al, , 2019b. The dataset contains 70861, 500 and 500 parallel sentences for training, validation and test sets, respectively.…”

Section: Benchmark Data and Preprocessingmentioning

confidence: 99%

Bangla↔English Machine Translation Using Attention-based Multi-Headed Transformer Model

Dhar¹,

Roy²,

Akhand³

et al. 2021

Journal of Computer Science

View full text Add to dashboard Cite

Machine Translation (MT) refers to translate texts or documents from the source language into the target language without human intervention. Any MT model is language-dependent and its development requires grammar, phrase rules, vocabulary, or relevant data for the particular language pair. Hitherto, little research on MT for Bangla-English is reported in the literature, although Bangla is a major language. This study presents a deep learning-based MT system concerning both-way translation for the Bangla-English language pair. The attention-based multi-headed transformer model has been considered in this study due to its significant features of parallelism in input processing. A transformer model consisting of encoders and decoders is adapted by tuning different parameters (especially, number of heads) to identify the best performing model for Bangla to English and vice versa. The proposed model is tested on SUPara benchmark Bangla-English corpus and evaluated the Bilingual Evaluation Understudy (BLEU) score, which is currently the most popular evaluation metric in the MT field. The proposed method is revealed as a promising Bangla-English MT system achieving BLEU scores of 21.42 and 25.44 for Bangla to English and English to Bangla MT cases, respectively.

show abstract

Section: Benchmark Data and Preprocessingmentioning

confidence: 99%

Bangla↔English Machine Translation Using Attention-based Multi-Headed Transformer Model

Dhar¹,

Roy²,

Akhand³

et al. 2021

Journal of Computer Science

View full text Add to dashboard Cite

show abstract

“…In our experiment, we used Shahjalal University parallel (SUPara) (Mumin et al, 201; 2018b) corpus and GolbalVoices (Tiedemann, 2012) corpus from OPUS (Tiedemann, 2012) as a training dataset. SUPara (Mumin et al, 2012;2018b) is a balanced corpus consists of texts from different genres like literature, journalistic texts, instructive texts, administrative texts, and texts treating external communication, which are collected from various printed and online media. GolbalVoices (Tiedemann, 2012) corpus consists of only news texts collected form GlobalVoices website iv .…”

Section: Datasetmentioning

confidence: 99%

“…These two datasets were developed with a vision of using them as a benchmark in English-Bangla MT research. The texts of these two datasets were well-chosen from balanced SUPara (Mumin et al, 2012;2018b) corpus, thus these two datasets are also balanced in genre. In addition, to make these datasets representative in length we selected the texts from 10 subsets of different lengths: 1 to 5 words, 6 to 10 and so forth up to 40 to 45 and finally longer than 45 words.…”

Section: Datasetmentioning

confidence: 99%

“…Our core PBMT system is implemented using the Moses (Koehn et al, 2007) SMT toolkit. We trained our system on English-Bangla parallel training dataset which is a combination of SUPara (Mumin et al, 2012;2018b) and GlobalVoices (Tiedemann, 2012) corpus. We extracted symmetrized word alignments from this training dataset using GIZA++ (Och and Ney, 2003) and grow-diag-final-and heuristic.…”

Section: Pbmt System Configurationmentioning

confidence: 99%

“…En→Bn. For En→Bn translation task, we trained our system on English-Bangla parallel training dataset which is a combination of SUPara (Mumin et al, 2012;2018b) and GlobalVoices (Tiedemann, 2012) corpus and on Bangla monolingual dataset, SUMono (Mumin et al, 2014). Then, we tuned our system using Minimum Error Rate Training (MERT) (Och, 2003) on the Bangla side of the development dataset, SUParadev2018 (Mumin et al, 2018a), so as to maximize the BLEU (Papineni et al, 2002) score.…”

Section: Bidirectional Translationmentioning

confidence: 99%

See 2 more Smart Citations