Universal dependencies (UD) is a framework for morphosyntactic annotation of human language, which to date has been used to create treebanks for more than 100 languages. In this article, we outline the linguistic theory of the UD framework, which draws on a long tradition of typologically oriented grammatical theories. Grammatical relations between words are centrally used to explain how predicate–argument structures are encoded morphosyntactically in different languages while morphological features and part-of-speech classes give the properties of words. We argue that this theory is a good basis for cross-linguistically consistent annotation of typologically diverse languages in a way that supports computational natural language understanding as well as broader linguistic studies.
Soumis à la revue Langages 13 / 13 Version auteur du 30 mars 2017 Mots clés genres textuels, linguistique de corpus, français médiéval, discours direct, genres oralisés Title Relationships between represented oral speech in medieval French texts and textual genres Abstract Relationships between speech and writing in medieval French are analysed through a corpus composed of 137 texts (4 millions tokens). Text chunks representing speech (quotes, speech turns, etc.) are contrasted with remaining text parts, taking into account the genre of the text to contextualize every chunk (from a 32-genre typology). A correspondence analysis is performed on part-of-speech tags (34 tags). It reveals an orality axis as the first dimension of variation ; every genre, divided into reported speech and the rest, automatically gets a coordinate on this axis. Among the results, we observe that if a text is from the literary domain or is intended to oral performance (such as a song, a play or a recital), then its orality features are emphasized ; that the ways of expressing non-orality are more diverse and heterogenous than those of orality ; that statistics clearly sets apart from orality a genre like the didactic dialog, in which speech turns are used as a conventional and artificial layout ; or also that psalms, which could be supposed to be very close to poems and therefore to orality, are on the opposite side of the orality axis and present main features for non orality.
Résumé Deux approches de la constitution de corpus de textes médiévaux se dessinent depuis une dizaine d’années déjà : 1. numérisation d’éditions critiques modernes ; 2. création de transcriptions diplomatiques précises de manuscrits, éventuellement accompagnées des images des originaux. Ces approches sont en réalité plutôt complémentaires qu’opposées, car elles permettent aux chercheurs de faire le choix entre la quantité (représentativité) et la qualité (sûreté et richesse) des données en fonction de la recherche effectuée. Pour les deux types de corpus nous analysons les enjeux de l’utilisation d’une représentation normalisée du texte et de son ‘ancrage’ signalétique (norme XML et conventions de représentation TEI 1 ). Les problèmes méthodologiques qui se posent lors de la création et de l’exploitation des corpus de textes anciens, et leurs solutions sont aussi valables pour d’autres types de corpus linguistiques.
International audienceThis papers presents an experience of specifying and implementing an XML format for text to image alignment at word and character level within the TEI framework. The format in question is a supplementary markup layer applied to heterogeneous transcriptions of medieval Latin and French manuscripts encoded using different " flavors " of the TEI (normalized for critical editions, diplomatic or palaeographic transcriptions). One of the problems that had to be solved was identifying " non-alignable " spans in various kinds of transcriptions. Originally designed in the framework of a research project on the ontology of letter-forms in medieval Latin and vernacular (mostly French) manuscripts and inscriptions, this format can be of use for all kinds of projects that involve fine-grain alignment of transcriptions with zones on digital images
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.