Katja Zupan scite author profile

Katja Zupan

3Publications

4Citation Statements Received

30Citation Statements Given

How they've been cited

How they cite others

Affiliations

Jožef Stefan International Postgraduate School, Jožef Stefan Institute

Publications

Order By: Most citations

How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts

2019

View full text Add to dashboard Cite

Part-of-speech (PoS) tagging of non-standard language with models developed for standard language is known to suffer from a significant decrease in accuracy. Two methods are typically used to improve it: word normalisation, which decreases the out-of-vocabulary rate of the PoS tagger, and domain adaptation where the tagger is made aware of the non-standard language variation, either through supervision via non-standard data being added to the tagger’s training set, or via distributional information calculated from raw texts. This paper investigates the two approaches, normalisation and domain adaptation, on carefully constructed data sets encompassing historical and user-generated Slovene texts, in particular focusing on the amount of labour necessary to produce the manually annotated data sets for each approach and comparing the resulting PoS accuracy. We give quantitative as well as qualitative analyses of the tagger performance in various settings, showing that on our data set closed and open class words exhibit significantly different behaviours, and that even small inconsistencies in the PoS tags in the data have an impact on the accuracy. We also show that to improve tagging accuracy, it is best to concentrate on obtaining manually annotated normalisation training data for short annotation campaigns, while manually producing in-domain training sets for PoS tagging is better when a more substantial annotation campaign can be undertaken. Finally, unsupervised adaptation via Brown clustering is similarly useful regardless of the size of the training data available, but improvements tend to be bigger when adaptation is performed via in-domain tagging data.

show abstract

Poročilo z delavnice projekta European Language Resources Coordination (ELRC) v Ljubljani (8. 12. 2015)

Zupan

2016

SLO2.0

View full text Add to dashboard Cite

Delavnica projekta European Language Resources Coordination (ELRC) je potekala 8. decembra 2015 na Institutu »Jožef Stefan« (IJS) v Ljubljani. Organizirala sta jo Center za prenos znanja na področju informacijskih tehnologij ter Laboratorij za umetno inteligenco IJS skupaj s Predstavništvom Evropske komisije v Sloveniji. Nacionalni koordinator dogodka je bil predstavnik ELRC v Sloveniji Simon Krek z IJS, konzorcij ELRC pa je zastopal Stelios Piperidis. Delavnice se je udeležilo 38 udeležencev, večinoma predstavnikov ministrstev in drugih javnih služb, pa tudi računalniški strokovnjaki in samostojni prevajalci. Videoposnetek delavnice in posamezne predstavitve si je mogoče ogledati na portalu Videolectures.

show abstract

Poročilo z delavnice projekta European Language Resources Coordination (ELRC) v Ljubljani (8. 12. 2015)

Zupan

2016

SLO2.0

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Katja Zupan

How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts

Poročilo z delavnice projekta European Language Resources Coordination (ELRC) v Ljubljani (8. 12. 2015)

Poročilo z delavnice projekta European Language Resources Coordination (ELRC) v Ljubljani (8. 12. 2015)

Contact Info

Product

Resources

About