SanskritTagger: A Stochastic Lexical and POS Tagger for Sanskrit

Hellwig, Oliver

doi:10.1007/978-3-642-00155-0_11

Cited by 21 publications

(5 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…most deep learning methods, we decided to release a new dataset along with this paper. Each sentence contained in the DCS is re-analyzed using the San-skritTagger software (Hellwig, 2009). Our dataset contains the surface forms of sentences in the DCS and the split points and Sandhi rules that the tagger proposes for their morpho-lexical gold analyses stored in the DCS.…”

Section: Datamentioning

confidence: 99%

Sanskrit Word Segmentation Using Character-level Recurrent and Convolutional Neural Networks

Hellwig¹,

Nehrdich²

2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

The paper introduces end-to-end neural network models that tokenize Sanskrit by jointly splitting compounds and resolving phonetic merges (Sandhi). Tokenization of Sanskrit depends on local phonetic and distant semantic features that are incorporated using convolutional and recurrent elements. Contrary to most previous systems, our models do not require feature engineering or extern linguistic resources, but operate solely on parallel versions of raw and segmented text. The models discussed in this paper clearly improve over previous approaches to Sanskrit word segmentation. As they are language agnostic, we will demonstrate that they also outperform the state of the art for the related task of German compound splitting.

show abstract

Section: Datamentioning

confidence: 99%

Sanskrit Word Segmentation Using Character-level Recurrent and Convolutional Neural Networks

Hellwig¹,

Nehrdich²

2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Experiments with automatic POS-tagging of less-resourced languages have already been conducted in recent years. This subsection briefly describes the techniques used and the outcome of two projects: an automatic tagger for Urdu, developed by Hardie (2005), and Sanskrittagger (Hellwig 2008).…”

Section: Similar Experimentsmentioning

confidence: 99%

“…Sanskrit tagger, described in Hellwig (2008), is an automatic tokenizer and tagger for Sanskrit. Like Hardie's Urdu tagger, it uses HMM to perform the tagging.…”

Section: Similar Experimentsmentioning

confidence: 99%

Diachrony and typology of non-finites in Indo-Aryan

Stroński¹,

Tokaj²,

Jaworski³

2020

View full text Add to dashboard Cite

“…But the amount of annotated data available for Sanskrit is very small compared to the size of the texts available in it from ancient times. An effort towards having such an annotated data was initiated and resulted into the Digital Corpus of Sanskrit (DCS) (Hellwig, 2010(Hellwig, 2019 This data, being of reasonable size, can be used for both statistical analyses and use of machine learning algorithms. This paper focuses on how DCS's data can be used along with the Heritage Engine's analysis so that we get a proper morphologically tagged and segmented corpus.…”

Section: Introductionmentioning

confidence: 99%

Validation and Normalization of DCS corpus using Sanskrit Heritage tools to build a tagged Gold Corpus

Krishnan,

Kulkarni,

Huet

2020

Preprint

View full text Add to dashboard Cite

The Digital Corpus of Sanskrit records around 650,000 sentences along with their morphological and lexical tagging. But inconsistencies in morphological analysis, and in providing crucial information like the segmented word, urges the need for standardization and validation of this corpus. Automating the validation process requires efficient analyzers which also provide the missing information. The Sanskrit Heritage Engine's Reader produces all possible segmentations with morphological and lexical analyses. Aligning these systems would help us in recording the linguistic differences, which can be used to update these systems to produce standardized results and will also provide a Gold corpus tagged with complete morphological and lexical information along with the segmented words. Krishna et al. ( 2017) aligned 115,000 sentences, considering some of the linguistic differences. As both these systems have evolved significantly, the alignment is done again considering all the remaining linguistic differences between these systems. This paper describes the modified alignment process in detail and records the additional linguistic differences observed.

show abstract

SanskritTagger: A Stochastic Lexical and POS Tagger for Sanskrit

Cited by 21 publications

References 2 publications

Sanskrit Word Segmentation Using Character-level Recurrent and Convolutional Neural Networks

Sanskrit Word Segmentation Using Character-level Recurrent and Convolutional Neural Networks

Diachrony and typology of non-finites in Indo-Aryan

Validation and Normalization of DCS corpus using Sanskrit Heritage tools to build a tagged Gold Corpus

Contact Info

Product

Resources

About