“…A few parallel corpora are available for Bangla-English MT. In this study, SUPara (Al Mumin et al, 2012) dataset is used as a number of recent studies have used this corpus (Al Mumin et al, 2019aMumin et al, , 2019bHasan et al, 2019aHasan et al, , 2019b. The dataset contains 70861, 500 and 500 parallel sentences for training, validation and test sets, respectively.…”
Section: Benchmark Data and Preprocessingmentioning
Machine Translation (MT) refers to translate texts or documents from the source language into the target language without human intervention. Any MT model is language-dependent and its development requires grammar, phrase rules, vocabulary, or relevant data for the particular language pair. Hitherto, little research on MT for Bangla-English is reported in the literature, although Bangla is a major language. This study presents a deep learning-based MT system concerning both-way translation for the Bangla-English language pair. The attention-based multi-headed transformer model has been considered in this study due to its significant features of parallelism in input processing. A transformer model consisting of encoders and decoders is adapted by tuning different parameters (especially, number of heads) to identify the best performing model for Bangla to English and vice versa. The proposed model is tested on SUPara benchmark Bangla-English corpus and evaluated the Bilingual Evaluation Understudy (BLEU) score, which is currently the most popular evaluation metric in the MT field. The proposed method is revealed as a promising Bangla-English MT system achieving BLEU scores of 21.42 and 25.44 for Bangla to English and English to Bangla MT cases, respectively.
“…A few parallel corpora are available for Bangla-English MT. In this study, SUPara (Al Mumin et al, 2012) dataset is used as a number of recent studies have used this corpus (Al Mumin et al, 2019aMumin et al, , 2019bHasan et al, 2019aHasan et al, , 2019b. The dataset contains 70861, 500 and 500 parallel sentences for training, validation and test sets, respectively.…”
Section: Benchmark Data and Preprocessingmentioning
Machine Translation (MT) refers to translate texts or documents from the source language into the target language without human intervention. Any MT model is language-dependent and its development requires grammar, phrase rules, vocabulary, or relevant data for the particular language pair. Hitherto, little research on MT for Bangla-English is reported in the literature, although Bangla is a major language. This study presents a deep learning-based MT system concerning both-way translation for the Bangla-English language pair. The attention-based multi-headed transformer model has been considered in this study due to its significant features of parallelism in input processing. A transformer model consisting of encoders and decoders is adapted by tuning different parameters (especially, number of heads) to identify the best performing model for Bangla to English and vice versa. The proposed model is tested on SUPara benchmark Bangla-English corpus and evaluated the Bilingual Evaluation Understudy (BLEU) score, which is currently the most popular evaluation metric in the MT field. The proposed method is revealed as a promising Bangla-English MT system achieving BLEU scores of 21.42 and 25.44 for Bangla to English and English to Bangla MT cases, respectively.
“…In our experiment, we used Shahjalal University parallel (SUPara) (Mumin et al, 201; 2018b) corpus and GolbalVoices (Tiedemann, 2012) corpus from OPUS (Tiedemann, 2012) as a training dataset. SUPara (Mumin et al, 2012;2018b) is a balanced corpus consists of texts from different genres like literature, journalistic texts, instructive texts, administrative texts, and texts treating external communication, which are collected from various printed and online media. GolbalVoices (Tiedemann, 2012) corpus consists of only news texts collected form GlobalVoices website iv .…”
Section: Datasetmentioning
confidence: 99%
“…These two datasets were developed with a vision of using them as a benchmark in English-Bangla MT research. The texts of these two datasets were well-chosen from balanced SUPara (Mumin et al, 2012;2018b) corpus, thus these two datasets are also balanced in genre. In addition, to make these datasets representative in length we selected the texts from 10 subsets of different lengths: 1 to 5 words, 6 to 10 and so forth up to 40 to 45 and finally longer than 45 words.…”
Section: Datasetmentioning
confidence: 99%
“…Our core PBMT system is implemented using the Moses (Koehn et al, 2007) SMT toolkit. We trained our system on English-Bangla parallel training dataset which is a combination of SUPara (Mumin et al, 2012;2018b) and GlobalVoices (Tiedemann, 2012) corpus. We extracted symmetrized word alignments from this training dataset using GIZA++ (Och and Ney, 2003) and grow-diag-final-and heuristic.…”
Section: Pbmt System Configurationmentioning
confidence: 99%
“…En→Bn. For En→Bn translation task, we trained our system on English-Bangla parallel training dataset which is a combination of SUPara (Mumin et al, 2012;2018b) and GlobalVoices (Tiedemann, 2012) corpus and on Bangla monolingual dataset, SUMono (Mumin et al, 2014). Then, we tuned our system using Minimum Error Rate Training (MERT) (Och, 2003) on the Bangla side of the development dataset, SUParadev2018 (Mumin et al, 2018a), so as to maximize the BLEU (Papineni et al, 2002) score.…”
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.