Developing the Old Tibetan Treebank

Faggionato, Christian; Meelen, Marieke

doi:10.26615/978-954-452-056-4_035

Cited by 4 publications

(12 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Philologists generally consider the Annals, that record historical events in the 7-8th centuries, to be older than the more extensive Chronicle, although exact dates of origin are still a matter of ongoing debate (cf. Faggionato and Meelen (2019)). Tibetan texts written between the 11th and mid-20th centuries are generally referred to as 'Classical Tibetan', without further chronological subclassification.…”

Section: Composition Of the Annotated Corpusmentioning

confidence: 99%

“…The linguistic annotation of PACTib consists of tokenisation, sentence segmentation, part-ofspeech tags and syntactic phrase structure labels building for a constituency treebank on recent work by and Faggionato and Meelen (2019). We optimised their methods after an error analysis and for the purposes of this paper, focused mainly on creating meaningful sentence segmentation.…”

Section: Linguistic Annotationmentioning

confidence: 99%

“…erroneously tagged དང་ dang 'and, (together) with' > case.ass 'associative case marker', since in the context directly following nouns, it can never be anything else). Syntactic phrase-structural information was added using the rule-based regular expression parser developed by Faggionato and Meelen (2019) that combines Tibetan POS tags into phrases using an extended form of the NLTK's regular expression chunkparser. This form of constituency parsing was chosen to facilitate comparative historical syntactic research on phrase structure in the UPenn historical treebank tradition.…”

Section: Pos Tagging and Parsingmentioning

confidence: 99%

“…Other new vocabulary, mainly from after the industrial and technological revolutions, mostly consists of nouns. Since count nouns (tagged n.count) are by far the most frequently-occurring tags, the memory-based tagger (and the neural tagger developed by Faggionato and Meelen (2019)) mainly assign this n.count tag to unknown words in the right context, these new vocabulary items pose no significant problem in Present-Day Spoken Tibetan texts.…”

Section: Pos Tagging and Parsingmentioning

confidence: 99%

“…However, the resulting publications 1 rarely make data or code available, effectively making it impossible to test, verify or use the results in any way. Instead, for the development of PACTib, we build on recent work on segmenting and POS tagging Tibetan by Garrett et al (2014), and Faggionato and Meelen (2019) (see Section 3). In Section 2 we discuss the composition of the corpus and a proposal to allow for distinguishing easily between prose and verse.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Meta-dating the PArsed Corpus of Tibetan (PACTib)

Meelen¹,

Roux²

2020

Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories

Self Cite

View full text Add to dashboard Cite

This paper presents PACTib, the PArsed Corpus of Tibetan. This new resource is unique in bringing together a large number of Tibetan texts (>5000) from the 11th century until the present day. The texts in this diachronic corpus are provided with metadata containing information on dates and patron-/authorship and linguistic annotation in the form of tokenisation, sentence segmentation, part-of-speech tags and syntactic phrase structure. With over 166 million tokens across 11 centuries and a variety of genres, PACTib will open up a wide range of research opportunities for historical and comparative linguistics and scholars in Tibetan Studies, which we illustrate with two short case studies.

show abstract

Section: Composition Of the Annotated Corpusmentioning

confidence: 99%

Section: Linguistic Annotationmentioning

confidence: 99%

Section: Pos Tagging and Parsingmentioning

confidence: 99%

Section: Pos Tagging and Parsingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Meta-dating the PArsed Corpus of Tibetan (PACTib)

Meelen¹,

Roux²

2020

Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories

Self Cite

View full text Add to dashboard Cite

show abstract

Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods

Meelen

Roux²,

Hill

2021

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Self Cite

View full text Add to dashboard Cite

This article presents a pipeline that converts collections of Tibetan documents in plain text or XML into a fully segmented and POS-tagged corpus. We apply the pipeline to the large extent collection of the Buddhist Digital Resource Center. The semi-supervised methods presented here not only result in a new and improved version of the largest annotated Tibetan corpus to date, the integration of rule-based, memory-based, and neural-network methods also serves as a good example of how to overcome challenges of under-researched languages. The end-to-end accuracy of our entire automatic pipeline of 91.99% is high enough to make the resulting corpus a useful resource for both linguists and scholars of Tibetan studies.

show abstract

Neural Dependency Parser for Tibetan Sentences

Long

2021

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

View full text Add to dashboard Cite

The research of Tibetan dependency analysis is mainly limited to two challenges: lack of a dataset and reliance on expert knowledge. To resolve the preceding challenges, we first introduce a new Tibetan dependency analysis dataset, and then propose a neural-based framework that resolves the reliance on the expert knowledge issue by automatically extracting feature vectors of words and predicts their head words and type of dependency arcs. Specifically, we convert the words in the sentence into distributional vectors and employ a sequence to vector network to extract feature words. Furthermore, we introduce a head classifier and type classifier to predict the head word and type of dependency arc, respectively. Experiments demonstrate that our model achieves promising performance on the Tibetan dependency analysis task.

show abstract

Developing the Old Tibetan Treebank

Cited by 4 publications

References 8 publications

Meta-dating the PArsed Corpus of Tibetan (PACTib)

Meta-dating the PArsed Corpus of Tibetan (PACTib)

Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods

Neural Dependency Parser for Tibetan Sentences

Contact Info

Product

Resources

About