DeepCut: A Thai word tokenization library using Deep Neural Network.

Kittinaradorn, Rakpong; Chaovavanich, Korakot; Achakulvisut, Titipat; Srithaworn, Kittinan; Chormai, Pattarawat; Kaewkasi, Chanwit; Ruangrong, Tulakan; Oparad, Krichkorn

doi:10.5281/zenodo.3457707

Cited by 8 publications

(3 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The table gives total number of words (#w) and words per sentences (#w/s) for each language. Thai was tokenized with Deepcut(Kittinaradorn et al, 2019).…”

mentioning

confidence: 99%

An Information-Theoretic Approach and Dataset for Probing Gender Stereotypes in Multilingual Masked Language Models

Victor¹,

Dufter²,

Haris³

et al. 2022

Findings of the Association for Computational Linguistics: NAACL 2022

View full text Add to dashboard Cite

Warning: This work deals with statements of a stereotypical nature that may be upsetting.Bias research in NLP is a rapidly growing and developing field. Similar to CrowS-Pairs (Nangia et al., 2020), we assess gender bias in masked-language models (MLMs) by studying pairs of sentences that are identical except that the individuals referred to have different gender. Most bias research focuses on and often is specific to English. Using a novel methodology for creating sentence pairs that is applicable across languages, we create, based on CrowS-Pairs, a multilingual dataset for English, Finnish, German, Indonesian and Thai. Additionally, we propose S JSD , a new bias measure based on Jensen-Shannon divergence, which we argue retains more information from the model output probabilities than other previously proposed bias measures for MLMs. Using multilingual MLMs, we find that S JSD diagnoses the same systematic biased behavior for non-English that previous studies have found for monolingual English pre-trained MLMs. S JSD outperforms the CrowS-Pairs measure, which struggles to find such biases for smaller non-English datasets.

show abstract

“…The table gives total number of words (#w) and words per sentences (#w/s) for each language. Thai was tokenized with Deepcut(Kittinaradorn et al, 2019).…”

mentioning

confidence: 99%

An Information-Theoretic Approach and Dataset for Probing Gender Stereotypes in Multilingual Masked Language Models

Victor¹,

Dufter²,

Haris³

et al. 2022

Findings of the Association for Computational Linguistics: NAACL 2022

View full text Add to dashboard Cite

show abstract

“…Deepcut [10] and Attacut [11] are two remarkable tokenizers in this thesis. Deepcut is a state-of-the-art technique using characters embedding with 1d-convolutional network and predicting the first character of words in a sentence while Attacut proposed using syllable boundaries instead of using word boundaries.…”

Section: Tokenization Techniquementioning

confidence: 94%

Thai tokenizer invariant classification based on bi-lstm and distilbert encoders

Kongsumran

View full text Add to dashboard Cite

Natural language processing (NLP) is a topic in artificial intelligence to teach computer to understand human language. Researchers can feed text of some particular language in any length and type such as characters, words, and sentences into the algorithm to extract a summarized context in terms of numbers. To accept a word array in Thai language, tokenization process is needed to split a text into words because each sentence is written consecutively without any space between words. In general, different tokenizers can produce different sets of words from a single sentence, resulting in uncontrolled accuracies in NLP and related tasks. In this research, a method to solve the different results from different Thai tokenizers is introduced by aligning tokenization results together in the similar direction using neural networks encoders. Bi-LSTM and DistilBERT with triplet hard loss are used to train and transform sets of words to data in a new domain where vectors of each similar sentence are significantly closer. Finally, twenty-eight classifiers are created using two types of encoders, seven different tokenizers, with and without using the proposed method for comparative and analysis purposes. To demonstrate that the proposed approach can be used as a pre-trained method for other tasks, the sentiment datasets are used to measure the classification accuracy and investigate similarities of results from all classifiers.

show abstract

“…We curated two corpora with 27M words/145M letters from Thai Wikipedia and 69M words/330M letters from Pantip (Thai Q&A forum). For each corpus, we did word tokenization using DeepCut (Kittinaradorn et al, 2019) and trained word-based n-gram models using KenLM (Heafield, 2011). The final LM is obtained by n-gram interpolation.…”

Section: Experimental Setupsmentioning

confidence: 99%

Incorporating context into non-autoregressive model using contextualized CTC for sequence labelling

Naowarat

View full text Add to dashboard Cite

Connectionist Temporal Classification (CTC) loss has become widely used in sequence modeling tasks such as Automatic Speech Recognition (ASR) and Handwritten Text Recognition (HTR) due to its ease of use. CTC itself has no architecture constraints, but it is commonly used with recurrent models that predict letters based on histories in order to relax the conditional independent assumption. However, recent sequence models that incorporate CTC loss have been focusing on speed by removing recurrent structures, hence losing important context information. This thesis presents Contextualized Connectionist Temporal Classification (CCTC) loss, which induces prediction dependencies in non-recurrent and non-autoregressive neural networks for sequence modeling. CCTC allows the model to implicitly learn the language model by predicting neighboring labels via multi-task learning. Experiments on ASR and HTR tasks in two different languages show that CCTC models offer improvements over CTC models by 2.2-8.4% relative without incurring extra inference costs.

show abstract

DeepCut: A Thai word tokenization library using Deep Neural Network.

Cited by 8 publications

References 0 publications

An Information-Theoretic Approach and Dataset for Probing Gender Stereotypes in Multilingual Masked Language Models

An Information-Theoretic Approach and Dataset for Probing Gender Stereotypes in Multilingual Masked Language Models

Thai tokenizer invariant classification based on bi-lstm and distilbert encoders

Incorporating context into non-autoregressive model using contextualized CTC for sequence labelling

Contact Info

Product

Resources

About