On the use of machine translation-based approaches for vietnamese diacritic restoration

Pham, Thai-Hoang; Pham, Xuan-Khoai; Le-Hong, Phuong

doi:10.1109/ialp.2017.8300596

Cited by 8 publications

(11 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While these deep models achieve state-of-the-art performance, they mainly rely on the use of recurrent architectures such as BiLSTM, which are relatively inefficient. Pham et al (2017) view the task of diacritization for Vietnamese as a machine transduction problem from undiacritized to diacritized text at the word level. Orife (2018) addresses the problem on Yoruba in a similar way and compares softand self-attention sequence-to-sequence performance on the word level empirically showing that self-attention significantly outperforms BiLSTM.…”

Section: Related Workmentioning

confidence: 99%

“…Feature engineering and classical machine learning algorithms such as Hidden Markov Models, Maximum Entropy Models, and Finite State Transducer were the dominant approaches (Nelken and Shieber, 2005;Zitouni et al, 2006;Elshafei et al, 2006). However, recent studies show significant improvement using deep neural networks (Belinkov and Glass, 2015;Pham et al, 2017;Orife, 2018). While these deep models achieve state-of-the-art performance, they mainly rely on the use of recurrent architectures such as BiLSTM, which are relatively inefficient.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Convolutional Neural Networks for Diacritic Restoration

Alqahtani¹,

Mishra²,

Diab³

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

Diacritic restoration has gained importance with the growing need for machines to understand written texts. The task is typically modeled as a sequence labeling problem and currently Bidirectional Long Short Term Memory (BiLSTM) models provide state-of-the-art results. Recently, Bai et al. (2018) show the advantages of Temporal Convolutional Neural Networks (TCN) over Recurrent Neural Networks (RNN) for sequence modeling in terms of performance and computational resources. As diacritic restoration benefits from both previous as well as subsequent timesteps, we further apply and evaluate a variant of TCN, Acausal TCN (A-TCN), which incorporates context from both directions (previous and future) rather than strictly incorporating previous context as in the case of TCN. A-TCN yields significant improvement over TCN for diacritization in three different languages: Arabic, Yoruba, and Vietnamese. Furthermore, A-TCN and BiLSTM have comparable performance, making A-TCN an efficient alternative over BiLSTM since convolutions can be trained in parallel. A-TCN is significantly faster than BiLSTM at inference time (270%∼334% improvement in the amount of text diacritized per minute).1 With the exception of religious scripture and educational books for children, which are always diacritized.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Efficient Convolutional Neural Networks for Diacritic Restoration

Alqahtani¹,

Mishra²,

Diab³

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

show abstract

“…In SE researches, the problem of Type Inference using MT shows that SMT model provided by [37] has a significant higher accuracy compared to the original NMT approach in [18]. Similarly, for natural language diacritic restoration, [35] shows that SMT outperforms NMT. [18,30,37] have the same characteristics of parallel corpus, i.e., the length of source and target pairs are equal and the order of the source and target words are consistent with each other.…”

Section: Introductionmentioning

confidence: 99%

“…2 PREFIX RESOLUTION [18,37] provide a translation approach considered the source side language as partial class name (PCN) and the target language as Fully Qualified Name (FQN) of APIs. [35] treated the source language as a word without diacritic information and the target language as a word with diacritic information. In other words, both of these research works build a parallel corpus with the same length of source sequence and target sequence to each pair.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Statistical machine translation outperforms neural machine translation in software engineering: why and how

Phan

Jannesari

2020

Proceedings of the 1st ACM SIGSOFT International Workshop on Representation Learning for Software Engineering and Program Langu

View full text Add to dashboard Cite

Neural Machine Translation (NMT) is the current trend approach in Natural Language Processing (NLP) to solve the problem of automatically inferring the content of target language given the source language. The ability of NMT is to learn deep knowledge inside languages by deep learning approaches. However, prior works show that NMT has its own drawbacks in NLP and in some research problems of Software Engineering (SE). In this work, we provide a hypothesis that SE corpus has inherent characteristics that NMT will confront challenges compared to the state-of-the-art translation engine based on Statistical Machine Translation. We introduce a problem which is significant in SE and has characteristics that challenges the ability of NMT to learn correct sequences, called Prefix Mapping. We implement and optimize the original SMT and NMT to mitigate those challenges. By the evaluation, we show that SMT outperforms NMT for this research problem, which provides potential directions to optimize the current NMT engines for specific classes of parallel corpus. By achieving the accuracy from 65% to 90% for code tokens generation of 1000 Github code corpus, we show the potential of using MT for code completion at token level.

show abstract