2018
DOI: 10.1587/transinf.2018edp7016
|View full text |Cite
|
Sign up to set email alerts
|

Improving Thai Word and Sentence Segmentation Using Linguistic Knowledge

Abstract: Word boundary ambiguity in word segmentation has long been a fundamental challenge within Thai language processing. The Conditional Random Fields (CRF) model is among the best-known methods to have achieved remarkably accurate segmentation. Nevertheless, current advancements appear to have left the problem of compound words unaccounted for. Compound words lose their meaning or context once segmented. Hence, we introduce a dictionary-based word-merging algorithm, which merges all kinds of compound words. Our ev… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
13
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(13 citation statements)
references
References 9 publications
0
13
0
Order By: Relevance
“…In the early stage of Thai word segmentation, dictionary-based learning techniques had been used along with machine-learning techniques, for instance, Markov models (Kawtrakul and Thumkanon, 1997), decision trees Theeramunkong and Usanavasin, 2001), and CRFs (Haruechaiyasak et al, 2008). CRFs have been shown to be particularly suitable for Thai sequence-labeling tasks (Kruengkrai et al, 2006;Haruechaiyasak and Kongyoung, 2009;Kruengkrai et al, 2009;Nararatwong et al, 2018).…”
Section: Thai Word Segmentation Revisitedmentioning
confidence: 99%
See 2 more Smart Citations
“…In the early stage of Thai word segmentation, dictionary-based learning techniques had been used along with machine-learning techniques, for instance, Markov models (Kawtrakul and Thumkanon, 1997), decision trees Theeramunkong and Usanavasin, 2001), and CRFs (Haruechaiyasak et al, 2008). CRFs have been shown to be particularly suitable for Thai sequence-labeling tasks (Kruengkrai et al, 2006;Haruechaiyasak and Kongyoung, 2009;Kruengkrai et al, 2009;Nararatwong et al, 2018).…”
Section: Thai Word Segmentation Revisitedmentioning
confidence: 99%
“…In parallel with CRFs, neural-network models, e.g., CNNs (Kittinaradorn et al, 2019;Chormai et al, 2019), LSTM (Treeratpituk, 2017), and BiL-STM (Jousimo et al, 2017), have been applied and performed excellently for character-based Thai word segmentation. Using additional knowledge, such as CC (Lapjaturapit et al, 2018;Nararatwong et al, 2018), transfer learning (Seeha et al, 2020), and stacking ensemble (Limkonchotiwat et al, 2020), along with neural-network models could improve performance.…”
Section: Thai Word Segmentation Revisitedmentioning
confidence: 99%
See 1 more Smart Citation
“…The CRF-based model [15], which is extracted from n-grams around the considered word, achieves a F1 score of 91.9%, which is approximately 10% higher than the F1 scores achieved by other models [11,12,13] on the Orchid dataset, as mentioned in [14]. Nararatwong R. et al [45] extend this model using a POS-based word-splitting algorithm to increase identifiable POS tags, resulting in better model accuracy. Because the focus of this work is adjusting the POS as a postprocessing method, which is an input of the model instead of proposing a new sentence segmentation model, this work will not be considered in this paper.…”
Section: Thai Sentence Segmentationmentioning
confidence: 99%
“…However, unlike Chinese and Japanese, Thai WS did not receive much research attention. There are only six notable publications (Chormai et al, 2019;Nararatwong et al, 2018;Noyunsan et al;Thanadechteemapat and Fung;Tongtep and Theeramunkong) on Thai WS for the past ten years. On the other hand, there are at least eight papers from well-established conferences on Chinese and Japanese WS (Li et al, 2019;Aguirre and Aguiar, 2019;Ma et al, 2018;Gong et al, 2017;Chen et al, 2017;Zhou et al, 2017;Cai et al, 2017) within only the last two years.…”
Section: Introductionmentioning
confidence: 99%