Abstract. There are two main topics in this paper: (i) Vietnamese words are recognized and sentences are segmented into words by using probabilistic models; (ii) the optimum probabilistic model is constructed by an unsupervised learning processing. For each probabilistic model, new words are recognized and their syllables are linked together. The syllable-linking process improves the accuracy of statistical functions which improves contrarily the new words recognition. Hence, the probabilistic model will converge to the optimum one. Our experimented corpus is generated from about 250.000 online news articles, which consist of about 19.000.000 sentences. The accuracy of the segmented algorithm is over 90%. Our Vietnamese word and phrase dictionary contains more than 150.000 elements.
Abstract. In Vietnamese sentences, function words and word order patterns (WOPs) identify the semantic meaning and the grammatical word classes. We study the most popular WOPs and find out the candidates for new Vietnamese words (NVWs) based on the phrase and word segmentation algorithm [7]. The best WOPs, which are used for recognizing and tagging NVWs, are chosen based on the support and confidence concepts. These concepts are also used in examining if a word belongs to a word class.Our experiments were examined over a huge corpus, which contains more than 50 million sentences. Four sets of WOPs are studied for recognizing and tagging nouns, verbs, adjectives and pronouns. There are 6,385 NVWs in our new dictionary including 2,791 new noun-taggings, 1,436 new verb-tagging, 682 new adj-taggings, and 1,476 new pronoun taggings.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.