Data Augmentation Methods for Low-Resource Orthographic Syllabification

Suyanto, Suyanto; Lhaksmana, Kemas Muslim; Bijaksana, Moch Arif; Kurniawan, Adriana

doi:10.1109/access.2020.3015778

Cited by 4 publications

(1 citation statement)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In some simple languages, such as Indonesian, several data augmentation methods can be applied to solve this problem. For instance, a model named combination of flipping-onsets with standard-trigram and augmented-bigram syllabification (CFTABS) incorporate three augmentation techniques of flipping onsets, transposing nuclei, and swapping consonant-graphemes is developed in [23] . CFTABS produces a much lower SER than the original n -gram model with no augmentation.…”

Section: Introductionmentioning

confidence: 99%

Augmented words to improve a deep learning-based Indonesian syllabification

Suyanto

Romadhony

Sthevanie

et al. 2021

Heliyon

Self Cite

View full text Add to dashboard Cite

Recent deep learning-based syllabification models generally give low error rates for high-resource languages with big datasets but sometimes produce high error rates for the low-resource ones. In this paper, two procedures: massive data augmentation and validation, are proposed to improve a deep learning-based syllabification, using a combination of bidirectional long short-term memory (BiLSTM), convolutional neural networks (CNN), and conditional random fields (CRF) for a low-resource Indonesian language. The massive data augmentation comprises four methods: transposing nuclei, swapping consonant-graphemes, flipping onsets, and creating acronyms. Meanwhile, the validation is implemented using a phonotactic-based scheme. A preliminary investigation on 50k Indonesian words informs that those augmentation methods significantly enlarge the dataset size by 12.8M valid words based on the phonotactic rules. An examination is then performed using 5-fold crossvalidation. It reports that the augmentation methods significantly improve the BiLSTM-CNN-CRF model for 50k formal words and 100k named-entities datasets. A detailed investigation informs that augmenting the training set can reduce the word error rate (WER) coming from the long formal words and named entities.

show abstract

Section: Introductionmentioning

confidence: 99%