Proceedings of the 10th SIGHUM Workshop on Language Technology For Cultural Heritage, Social Sciences, and Humanities 2016
DOI: 10.18653/v1/w16-2105
|View full text |Cite
|
Sign up to set email alerts
|

Code-Switching Ubique Est - Language Identification and Part-of-Speech Tagging for Historical Mixed Text

Abstract: In this paper, we describe the development of a language identification system and a part-of-speech tagger for Latin-Middle English mixed text. To this end, we annotate data with language IDs and Universal POS tags (Petrov et al., 2012). As a classifier, we train a conditional random field classifier for both sub-tasks, including features generated by the TreeTagger models of both languages. The focus lies on both a general and a task-specific evaluation. Moreover, we describe our effort concerning beyond proo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2016
2016
2024
2024

Publication Types

Select...
5
4
1

Relationship

1
9

Authors

Journals

citations
Cited by 11 publications
(8 citation statements)
references
References 10 publications
0
8
0
Order By: Relevance
“…Besides this descriptive task, we are interested in practical tasks for predicting code-switching. There has been previous work formalizing code-switching detection in historical texts as a language ID task (Schulz and Keller, 2016;Sprugnoli et al, 2017), and models such as Conditional Random Fields (CRF) have been deployed to classify words as in one language or another. However, such approaches fail to work in the following scenario: when large collections of page images are transcribed with optical character recognition (OCR) or when large audio collections are transcribed by speech recognition, we do not always know a priori which languages will be included.…”
Section: Greekmentioning
confidence: 99%
“…Besides this descriptive task, we are interested in practical tasks for predicting code-switching. There has been previous work formalizing code-switching detection in historical texts as a language ID task (Schulz and Keller, 2016;Sprugnoli et al, 2017), and models such as Conditional Random Fields (CRF) have been deployed to classify words as in one language or another. However, such approaches fail to work in the following scenario: when large collections of page images are transcribed with optical character recognition (OCR) or when large audio collections are transcribed by speech recognition, we do not always know a priori which languages will be included.…”
Section: Greekmentioning
confidence: 99%
“…Research into C-S in spontaneously-produced and elicited spoken speech has offered insights into the social, cognitive, and structural dimensions of this multilingual phenomenon (Bullock and Toribio, 2009). The analysis of C-S in written discourse has garnered substantially less attention and, with some exceptions reviewed below (Montes-Alcalá, 2001;Callahan, 2004Callahan, , 2002, it has centered largely on C-S in historical texts as a genre (Latin macaronic poetry, medieval Castilian Spanish-Hebrew taqqanots 'ordinances', personal letters) (Demo, 2018;Schulz and Keller, 2016;Miller, 2001;Gardner-Chloros and Weston, 2015;Swain et al, 2002;Nurmi and Pahta, 2004).…”
Section: Related Workmentioning
confidence: 99%
“…Research into C-S in spontaneously-produced and elicited spoken speech has offered insights into the social, cognitive, and structural dimensions of this multilingual phenomenon (Bullock and Toribio, 2009). The analysis of C-S in written discourse has garnered substantially less attention and, with some exceptions reviewed below (Montes-Alcalá, 2001;Callahan, 2004Callahan, , 2002, it has centered largely on C-S in historical texts as a genre (Latin macaronic poetry, medieval Castilian Spanish-Hebrew taqqanots 'ordinances', personal letters) (Demo, 2018;Schulz and Keller, 2016;Miller, 2001;Gardner-Chloros and Weston, 2015;Swain et al, 2002;Nurmi and Pahta, 2004).…”
Section: A Appendixmentioning
confidence: 99%