2009
DOI: 10.1007/978-3-642-00155-0_11
|View full text |Cite
|
Sign up to set email alerts
|

SanskritTagger: A Stochastic Lexical and POS Tagger for Sanskrit

Abstract: SanskritTagger is a stochastic tagger for unpreprocessed Sanskrit text. The tagger tokenises text with a Markov model and performs part-of-speech tagging with a Hidden Markov model. Parameters for these processes are estimated from a manually annotated corpus of currently about 1.500.000 words. The article sketches the tagging process, reports the results of tagging a few short passages of Sanskrit text and describes further improvements of the program. The article describes design and function of SanskritTagg… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
5
0

Year Published

2010
2010
2020
2020

Publication Types

Select...
6
2
1

Relationship

2
7

Authors

Journals

citations
Cited by 21 publications
(5 citation statements)
references
References 2 publications
0
5
0
Order By: Relevance
“…most deep learning methods, we decided to release a new dataset along with this paper. Each sentence contained in the DCS is re-analyzed using the San-skritTagger software (Hellwig, 2009). Our dataset contains the surface forms of sentences in the DCS and the split points and Sandhi rules that the tagger proposes for their morpho-lexical gold analyses stored in the DCS.…”
Section: Datamentioning
confidence: 99%
“…most deep learning methods, we decided to release a new dataset along with this paper. Each sentence contained in the DCS is re-analyzed using the San-skritTagger software (Hellwig, 2009). Our dataset contains the surface forms of sentences in the DCS and the split points and Sandhi rules that the tagger proposes for their morpho-lexical gold analyses stored in the DCS.…”
Section: Datamentioning
confidence: 99%
“…Experiments with automatic POS-tagging of less-resourced languages have already been conducted in recent years. This subsection briefly describes the techniques used and the outcome of two projects: an automatic tagger for Urdu, developed by Hardie (2005), and Sanskrittagger (Hellwig 2008).…”
Section: Similar Experimentsmentioning
confidence: 99%
“…Sanskrit tagger, described in Hellwig (2008), is an automatic tokenizer and tagger for Sanskrit. Like Hardie's Urdu tagger, it uses HMM to perform the tagging.…”
Section: Similar Experimentsmentioning
confidence: 99%
“…But the amount of annotated data available for Sanskrit is very small compared to the size of the texts available in it from ancient times. An effort towards having such an annotated data was initiated and resulted into the Digital Corpus of Sanskrit (DCS) (Hellwig, 2010(Hellwig, 2019 This data, being of reasonable size, can be used for both statistical analyses and use of machine learning algorithms. This paper focuses on how DCS's data can be used along with the Heritage Engine's analysis so that we get a proper morphologically tagged and segmented corpus.…”
Section: Introductionmentioning
confidence: 99%