Tibetan word segmentation and POS tagging are the primary tasks of Tibetan natural language processing. Most of existing methods of Tibetan word segmentation and POS tagging are based on rules and statistics, which need manual construction of features. In addition, the joint mode has shown stronger capabilities for word segmentation and POS tagging, and have received great interests. In this paper, we propose Bi-LSTM+IDCNN+CRF structures, a simple yet effective end-to-end neural network model, for joint Tibetan word segmentation and POS tagging. We conduct step-by-step and joint experiments on the Tibetan datasets. The results demonstrate that the performance of the Bi-LSTM+IDCNN+CRF model is the best regardless of the step-by-step or joint mode. We obtain state-of-the-art performance in the joint tagging mode. The F1 score of the word segmentation task reached 92.31%, and the F1 score of the POS tagging task reached 81.26%.
Tibetan is a low-resource language with few existing electronic reference materials. The goal of Tibetan sentence boundary disambiguation (SBD) is to segment long text into sentences, and it is the foundation for downstream tasks corpora building. This study implemented the Tibetan SBD at the syllable level to avoid word segmentation (WS) errors affecting the accuracy of SBD. Specifically, the attention mechanism is introduced based on a recurrent neural network (RNN) to study Tibetan SBD. The primary objective is to determine, using a trained model, whether the shad contained in Tibetan text is the ending of the sentence. Implement experiments on syllable embedding and component embedding to measure the model's performance. The highest accuracy for Tibetan syllable embedding and component embedding is 96.23% and 95.40 %, respectively, and the F1 score reaches 96.23% and 95.37%, respectively. The experimental results demonstrate that the proposed method can achieve better results than the established rule-based and statistical methods without considering various syntactic and part-of-speech (POS) tagging rules. German and English data from the Europarl corpus and Thai data from the IWSLT2015 corpus are validated to prove the models’ reliability and generalizability. The results demonstrate that this method is efficient not only for low-resource languages but also for high-resource languages. More importantly, we can formally apply the experimental results of this study to the research of downstream tasks, such as machine translation and automatic summarization.
Considering how best it can be designed and realized through computer, the combination of all types of letters as well as its prefix, root, superscript, subscript, vowel, suffix, and farther suffix that are contained in a syllable will be transcribed by Latin after carefully identifying the attributes of Tibetan word in the statistical form. All these component parts of a syllable keep to the rules that have to be followed in the process of combining all types of letters, thereby theoretically producing normative syllables that consist of one, two, three and four letters. Normative syllables that are made up of respectively one, two, three and four characters can be found in the following numbers separately 445, 4985, 7212, 2250 and these syllables amount to 14,982 when all put together. Statistical results of Tibetan normative syllables in language databank with a size of 50 megabyte appear in these following numbers respectively: 415 single-character syllables, 2475 double-character syllables, 2423 triple-character syllables, 524 quadruple-character syllables and in total there are 5837 syllables. The findings in the experiments indicate that these syllables will turn up inconsistently across various language databases, but the most frequently occurring syllables are stably distributed while the non-frequent ones will differ in their level of presence, according to the size of database being referred to; and there is slight change in the frequency of the medium syllables. And statistical results that come up from the experiments seem to be contradictory to the number of theoretically normative Tibetan syllables available in Tibetan language, syllables in actual use only account for 39.2% of theoretically normative syllables. And syllables present in more than 90% of the texts only account for 12% of the syllables in actual use.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.