2018
DOI: 10.1145/3276773
|View full text |Cite
|
Sign up to set email alerts
|

Nova

Abstract: A feasible and flexible annotation system is designed for joint tokenization and part-of-speech (POS) tagging to annotate those languages without natural definitions of words. This design was motivated by the fact that word separators are not used in many highly analytic East and Southeast Asian languages. Although several of the languages are well-studied, e.g., Chinese and Japanese, many are understudied with low resources, e.g., Burmese (Myanmar) and Khmer. In the first part of the article, the proposed ann… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 20 publications
(1 citation statement)
references
References 4 publications
0
1
0
Order By: Relevance
“…Most pairs are from previous WMT (Gu, Kk, Tr, Ro, Et, Lt, Fi, Lv, Cs, Es, Zh, De, Ru, Fr ↔ En) and IWSLT (Vi, Ja, Ko, Nl, Ar, It ↔ En) competitions. We also use FLoRes pairs , En-Ne and En-Si), En-Hi from IITB (Kunchukuttan et al, 2017), and En-My from WAT19 (Ding et al, 2018(Ding et al, , 2019. We divide the datasets into three categories-low resource (<1M sentence pairs), medium resource (>1M and <10M), and high resource (>10M).…”
Section: Experimental Settingsmentioning
confidence: 99%
“…Most pairs are from previous WMT (Gu, Kk, Tr, Ro, Et, Lt, Fi, Lv, Cs, Es, Zh, De, Ru, Fr ↔ En) and IWSLT (Vi, Ja, Ko, Nl, Ar, It ↔ En) competitions. We also use FLoRes pairs , En-Ne and En-Si), En-Hi from IITB (Kunchukuttan et al, 2017), and En-My from WAT19 (Ding et al, 2018(Ding et al, , 2019. We divide the datasets into three categories-low resource (<1M sentence pairs), medium resource (>1M and <10M), and high resource (>10M).…”
Section: Experimental Settingsmentioning
confidence: 99%