We present an initial study into the representation of tree-adjoining grammar formalism for parsing Manipuri language. Being a low resource and computationally less researched language, it is difficult to achieve a natural language parser for Manipuri. Treebanks, which are the main requirement for inducing data-driven parsers, are not available for Manipuri. In this paper, we present an extensive analysis of the Manipuri language structure and formulate a lexicalized tree-adjoining grammar. A generalized structure of Manipuri phrases, clauses and the structure of basic and derived sentences have been presented. The sentence types covered in our analysis are that of simple, compound and complex sentences. Using the tree-adjoining grammar we have formulated, one can implement a Manipuri parser whose results can be of immense help in creating a Treebank for Manipuri.
Parsing, i.e., identifying the underlying hierarchical structure of natural language expressions is important for several natural language processing applications. In recent times Machine Learning (ML) approaches have been developed for this study for many languages. Most of the effective techniques require an annotated corpus of the language for training and validation. For the Manipuri language of the Tibeto-Burman family, neither such a corpus nor a grammar framework to automatically analyse and represent the structure of sentences exists yet. This study proposes a context-Free Grammar (CFG) that provides the framework to represent the structure of Manipuri sentences. This paves the way for parsing Manipuri sentences using CFG-based parsers for various applications and to conveniently build a Treebank for developing ML-based parsers for Manipuri. The rules of the proposed CFG are handcrafted after extensive analysis of the structure of Manipuri sentences. The grammar covers simple, compound, complex and compound-complex sentences. For evaluation, we induce an Earley's parser with the proposed CFG and test it over a collection of sentences that covers the possible varieties of structure. A recognition rate of 83.20% achieved in these experiments indicates the effectiveness of the proposed grammar.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.