Amrith Krishna scite author profile

The configurational information in sentences of a free word order language such as Sanskrit is of limited use. Thus, the context of the entire sentence will be desirable even for basic processing tasks such as word segmentation. We propose a structured prediction framework that jointly solves the word segmentation and morphological tagging tasks in Sanskrit. We build an energy based model where we adopt approaches generally employed in graph based parsing techniques (McDonald et al., 2005a;Carreras, 2007). Our model outperforms the state of the art with an F-Score of 96.92 (percentage improvement of 7.06%) while using less than one tenth of the task-specific training data. We find that the use of a graph based approach instead of a traditional lattice-based sequential labelling approach leads to a percentage gain of 12.6% in F-Score for the segmentation task. 1

show abstract

A Dataset for Sanskrit Word Segmentation

Krishna¹,

Satuluri²,

Goyal³

2017

View full text Add to dashboard Cite

The last decade saw a surge in digitisation efforts for ancient manuscripts in Sanskrit. Due to various linguistic peculiarities inherent to the language, even the preliminary tasks such as word segmentation are non-trivial in Sanskrit. Elegant models for Word Segmentation in Sanskrit are indispensable for further syntactic and semantic processing of the manuscripts. Current works in word segmentation for Sanskrit, though commendable in their novelty, often have variations in their objective and evaluation criteria. In this work, we set the record straight. We formally define the objectives and the requirements for the word segmentation task. In order to encourage research in the field and to alleviate the time and effort required in pre-processing, we release a dataset of 115,000 sentences for word segmentation. For each sentence in the dataset we include the input character sequence, ground truth segmentation, and additionally lexical and morphological information about all the phonetically possible segments for the given sentence. In this work, we also discuss the linguistic considerations made while generating the candidate space of the possible segments.

show abstract

Evaluating Neural Morphological Taggers for Sanskrit

Gupta¹,

Krishna²,

Goyal³

et al. 2020

View full text Add to dashboard Cite

Neural sequence labelling approaches have achieved state of the art results in morphological tagging. We evaluate the efficacy of four standard sequence labelling models on Sanskrit, a morphologically rich, fusional Indian language. As its label space can theoretically contain more than 40,000 labels, systems that explicitly model the internal structure of a label are more suited for the task, because of their ability to generalise to labels not seen during training. We find that although some neural models perform better than others, one of the common causes for error for all of these models is mispredictions due to syncretism. 1

show abstract

FeRoSA: A Faceted Recommendation System for Scientific Articles

Chakraborty

Krishna

Singh

et al. 2016

View full text Add to dashboard Cite

A Little Pretraining Goes a Long Way: A Case Study on Dependency Parsing Task for Low-resource Morphologically Rich Languages

Sandhan

Krishna

Gupta

et al. 2021

View full text Add to dashboard Cite

Neural dependency parsing has achieved remarkable performance for many domains and languages. The bottleneck of massive labeled data limits the effectiveness of these approaches for low resource languages. In this work, we focus on dependency parsing for morphological rich languages (MRLs) in a low-resource setting. Although morphological information is essential for the dependency parsing task, the morphological disambiguation and lack of powerful analyzers pose challenges to get this information for MRLs. To address these challenges, we propose simple auxiliary tasks for pretraining. We perform experiments on 10 MRLs in low-resource settings to measure the efficacy of our proposed pretraining method and observe an average absolute gain of 2 points (UAS) and 3.6 points (LAS). 1

show abstract

Poetry to Prose Conversion in Sanskrit as a Linearisation Task: A Case for Low-Resource Languages

Krishna

Sharma

Santra

et al. 2019

View full text Add to dashboard Cite

The word ordering in a Sanskrit verse is often not aligned with its corresponding prose order. Conversion of the verse to its corresponding prose helps in better comprehension of the construction. Owing to the resource constraints, we formulate this task as a word ordering (linearisation) task. In doing so, we completely ignore the word arrangement at the verse side. kāvya guru, the approach we propose, essentially consists of a pipeline of two pretraining steps followed by a seq2seq model. The first pretraining step learns task specific token embeddings from pretrained embeddings. In the next step, we generate multiple hypotheses for possible word arrangements of the input (Wang et al., 2018). We then use them as inputs to a neural seq2seq model for the final prediction. We empirically show that the hypotheses generated by our pretraining step result in predictions that consistently outperform predictions based on the original order in the verse. Overall, kāvya guru outperforms current state of the art models in linearisation for the poetry to prose conversion task in Sanskrit.

show abstract

Keep it Surprisingly Simple: A Simple First Order Graph Based Parsing Model for Joint Morphosyntactic Parsing in Sanskrit

Krishna¹,

Gupta²,

Garasangi³

et al. 2020

View full text Add to dashboard Cite

Morphologically rich languages seem to benefit from joint processing of morphology and syntax, as compared to pipeline architectures. We propose a graph-based model for joint morphological parsing and dependency parsing in Sanskrit. Here, we extend the Energy based model framework (Krishna et al., 2020), proposed for several structured prediction tasks in Sanskrit, in 2 simple yet significant ways. First, the framework's default input graph generation method is modified to generate a multigraph, which enables the use of an exact search inference. Second, we prune the input search space using a linguistically motivated approach, rooted in the traditional grammatical analysis of Sanskrit. Our experiments show that the morphological parsing from our joint model outperforms standalone morphological parsers. We report state of the art results in morphological parsing, and in dependency parsing, both in standalone (with gold morphological tags) and joint morphosyntactic parsing setting.

show abstract

A Graph-Based Framework for Structured Prediction Tasks in Sanskrit

Krishna

Santra

Gupta

et al. 2021

Computational Linguistics

View full text Add to dashboard Cite

We propose a framework using Energy Based Models for multiple structured prediction tasks in Sanskrit. Ours is an arc-factored model, similar to the graph based parsing approaches, and we consider the tasks of word-segmentation, morphological parsing, dependency parsing, syntactic linearisation and prosodification, a prosody level task we introduce in this work. Ours is a search based structured prediction framework, which expects a graph as input, where relevant linguistic information is encoded in the nodes, and the edges are then used to indicate the association between these nodes. Typically the state of the art models for morphosyntactic tasks in morphologically rich languages still rely on hand-crafted features for their performance. But here, we automate the learning of the feature function. The feature function so learnt along with the search space we construct, encode relevant linguistic information for the tasks we consider. This enables us to substantially reduce the training data requirements to as low as 10 % as compared to the data requirements for the neural state of the art models. Our experiments in Czech and Sanskrit show the language agnostic nature of the framework, where we train highly competitive models for both the languages. Moreover, our framework enables to incorporate language specific constraints to prune the search space and to filter the candidates during inference. We obtain significant improvements in morphosyntactic tasks for Sanskrit by incorporating language specific constraints into the model. In all the tasks we discuss for Sanskrit, we either achieve state of the art results or ours is the only data driven solution for those tasks.

show abstract

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Amrith Krishna

Free as in Free Word Order: An Energy Based Model for Word Segmentation and Morphological Tagging in Sanskrit

A Dataset for Sanskrit Word Segmentation

Evaluating Neural Morphological Taggers for Sanskrit

FeRoSA: A Faceted Recommendation System for Scientific Articles

A Little Pretraining Goes a Long Way: A Case Study on Dependency Parsing Task for Low-resource Morphologically Rich Languages

Poetry to Prose Conversion in Sanskrit as a Linearisation Task: A Case for Low-Resource Languages

Keep it Surprisingly Simple: A Simple First Order Graph Based Parsing Model for Joint Morphosyntactic Parsing in Sanskrit

A Graph-Based Framework for Structured Prediction Tasks in Sanskrit

Contact Info

Product

Resources

About