A Rule-Based Method for Thai Elementary Discourse Unit Segmentation (TED-Seg)

Ketui, Nongnuch; Theeramunkong, Thanaruk; Onsuwan, Chutamanee

doi:10.1109/kicss.2012.33

Cited by 5 publications

(2 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several research has addressed automatic segmentation in several languages, such as: French [1], English [14], Portuguese [8], Spanish [3,9] and Tahi. [6]. All converge to the idea of using an explicit list of marks in order to segment texts.…”

Section: State-of-the-artmentioning

confidence: 99%

Automatic Discourse Segmentation: an evaluation in French

Saksik,

Molina-Villegas,

Linhares

et al. 2020

Preprint

View full text Add to dashboard Cite

In this article, we describe some discursive segmentation methods as well as a first evaluation of the segmentation quality. Although our experiment were carried for documents in French, we have developed three discursive segmentation models solely based on resources simultaneously available in several languages: marker lists and a statistic POS labeling. We have also carried out automatic evaluations of these systems against the ANNODIS corpus, which is a manually annotated reference. The results obtained are very encouraging.

show abstract

Section: State-of-the-artmentioning

confidence: 99%

Automatic Discourse Segmentation: an evaluation in French

Saksik,

Molina-Villegas,

Linhares

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Here, "UNK" is a special type, assigned to tokens which cannot be classified into any of 24 existing types. In the past, a number of research works applied a similar tagset, such as those in [13], [21], [22]. Figure 4 illustrates an example of the tagging process in three stages.…”

Section: Taggingmentioning

confidence: 99%

Multi-Stage Automatic NE and PoS Annotation Using Pattern-Based and Statistical-Based Techniques for Thai Corpus Construction

Tongtep

Theeramunkong

2013

IEICE Trans. Inf. & Syst.

Self Cite

View full text Add to dashboard Cite

Nattapong TONGTEP†a) , Student Member and Thanaruk THEERAMUNKONG †b) , Member SUMMARY Automated or semi-automated annotation is a practical solution for large-scale corpus construction. However, the special characteristics of Thai language, such as lack of word-boundary and sentenceboundary markers, trigger several issues in automatic corpus annotation. This paper presents a multi-stage annotation framework, containing two stages of chunking and three stages of tagging. The two chunking stages are pattern matching-based named entity (NE) extraction and dictionarybased word segmentation while the three succeeding tagging stages are dictionary-, pattern-and statist09812490981249ical-based tagging. Applying heuristics of ambiguity priority, NE extraction is performed first on an original text using a set of patterns, in the order of pattern ambiguity. Next, the remaining text is segmented into words with a dictionary. The obtained chunks are then tagged with types of named entities or parts-of-speech (PoS) using dictionaries, patterns and statistics. Focusing on the reduction of human intervention in corpus construction, our experimental results show that the dictionary-based tagging process can assign unique tags to 64.92% of the words, with the remaining of 24.14% unknown words and 10.94% ambiguously tagged words. Later, the pattern-based tagging can reduce unknown words to only 13.34% while the statistical-based tagging can solve the ambiguously tagged words to only 3.01%.

show abstract