Diacritics restoration is the process of restoring original script from diacritic-free script by correct insertion of diacritics. In this paper, this problem is casted as a sequential tagging task where each term is tagged with its own accents. We did careful evaluations on three domains of Vietnamese: writing language, spoken language and literature using two methods: conditional random fields (CRFs) and support vector machines (SVMs), and achieved promising results. We also investigated two levels of lexical: learning from letters and learning from syllables. Although the former performs poorly than the latter, it shows stable results in all three language domains. Therefore, the letter level approach is more useful when we have to deal with unknown words or when words in a sentence are reordered and repeated to achieve stylistic and artistic effect.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.