The process of assigning morpho-syntactic categories of each morpheme including punctuation marks in a given text document according to the context is called Part of Speech (POS) tagging. In this paper we represent the rule-based Part of Speech Tagger of Manipuri by applying a set of hand written linguistic rules of Manipuri language. Nevertheless, it is very difficult to classify the lexical categories of Manipuri, an agglutinating Tibeto-Burman language of Northeast India. So, in this tagger we are using the affix stripping technique to segment the affixes from the root. As Manipuri has limited POS tagged corpus, the tagged output of this tagger will be very helpful to analyze Manipuri Part of speech by using many statistical models.
A word may have multiple senses and the challenge is to find out which particular sense is appropriate in a given context. Word sense disambiguation(WSD) resolves this ambiguity by finding out which particular sense of a word is appropriate in a given context. WSD is of critical importance in the areas of machine translation, information retrieval, speech processing etc. In this paper we present some approaches to Word sense disambiguation in Nepali using Nepali WordNet. These approaches are overlap based approach and conceptual distance and semantic graph based approach which falls under Knowledge based approach. Conceptual distance and semantic graph distance are used as a measures to score our WSD algorithm.
Computational intelligence and soft computing has many promising technologies such as Text Mining. Document Classification using soft computing techniques like fuzzy logic helps to find a more practical solution due to ambiguity and uncertainty present in the text data. Uncertainty and information may be reflected as the part and parcel of any industrial or engineering problem to be solved. Information refers to the facts required to solve it and uncertainty refers to the non-random lack of certainty (‘non-random uncertainty’), ambiguity, haziness in the system. It is very important to ponder on the nature of uncertainty involved in a problem. Father of fuzzy logic, Lofti Zadeh (1965) suggested that decision-making using set membership is the key when it is required to deal with uncertainty. Fuzzy clustering helps to identify patterns which are difficult to be discovered using crisp clustering. Natural languages contain non-random uncertainty. To deal with non-random uncertainty or different degrees of truth or partial truth Fuzzy logic may be used. This work focuses on fuzzy logic based approaches being utilized for identification of coherent patterns. Empirical Analysis are conducted to realize and evaluate the effect of the methodology proposed.
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection of data on the web there is a need for grouping(clustering) the documents into clusters for speedy information retrieval. Clustering of documents is collection of documents into groups such that the documents within each group are similar to each other and not to documents of other groups. Quality of clustering result depends greatly on the representation of text and the clustering algorithm. This paper presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO) and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means, Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter cluster similarity. .
Corpus is a large collection of homogeneous and authentic written texts (or speech) of a particular natural language which exists in machine readable form. The scope of the corpus is endless in Computational Linguistics and Natural Language Processing (NLP). Parallel corpus is a very useful resource for most of the applications of NLP, especially for Statistical Machine Translation (SMT). The SMT is the most popular approach of Machine Translation (MT) nowadays and it can produce high quality translationresult based on huge amount of aligned parallel text corpora in both the source and target languages. Although Bodo is a recognized natural language of India and co-official languages of Assam, still the machine readable information of Bodo language is very low. Therefore, to expand the computerized information of the language, English to Bodo SMT system has been developed. But this paper mainly focuses on building English-Bodo parallel text corpora to implement the English to Bodo SMT system using Phrase-Based SMT approach. We have designed an E-BPTC (English-Bodo Parallel Text Corpus) creator tool and have been constructed General and Newspaper domains English-Bodo parallel text corpora. Finally, the quality of the constructed parallel text corpora has been tested using two evaluation techniques in the SMT system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.