This paper introduces information density and machine translation quality estimation inspired features to automatically detect and classify human translated texts. We investigate two settings: discriminating between translations and comparable originally authored texts, and distinguishing two levels of translation professionalism. Our framework is based on delexicalised sentence-level dense feature vector representations combined with a supervised machine learning approach. The results show state-of-the-art performance for mixed-domain translationese detection with information density and quality estimation based features, while results on translation expertise classification are mixed.
The present study deals with variation in discourse relations in different registers of English and German. Our previous analyses have been concerned with the systemic contrasts between English and German, cf. Kunz & Steiner (2013 a/b), Kunz & Lapshinova (to appear) and have addressed some cross-linguistic differences with regard to textual realizations of selected subtypes of cohesion. In our current work, our focus is on the empirical analysis of cross-linguistic variation between registers. In order to obtain a more comprehensive picture, we investigate three main types of cohesion in combination: co-reference, substitution and conjunction and their subtypes, cf. Halliday & Hasan (1976). We extract instantiations of cohesive devices from an English-German corpus of spoken and written registers. The data is analyzed with statistical procedures which show that subcorpora can be grouped along particular combinations of cohesive devices.
We analyze the linguistic evolution of selected scientific disciplines over a 30-year time span (1970s to 2000s). Our focus is on four highly specialized disciplines at the boundaries of computer science that emerged during that time: computational linguistics, bioinformatics, digital construction, and microelectronics. Our analysis is driven by the question whether these disciplines develop a distinctive language use-both individually and collectively-over the given time period. The data set is the English Scientific Text Corpus (SCITEX), which includes texts from the 1970s/1980s and early 2000s. Our theoretical basis is register theory. In terms of methods, we combine corpus-based methods of feature extraction (various aggregated features [part-of-speech based], n-grams, lexico-grammatical patterns) and automatic text classification. The results of our research are directly relevant to the study of linguistic variation and languages for specific purposes (LSP) and have implications for various natural language processing (NLP) tasks, for example, authorship attribution, text mining, or training NLP tools.
We evaluate the output of 16 English-to-German MT systems with respect to the translation of pronouns in the context of the WMT 2018 competition. We work with a test suite specifically designed to assess system quality in various fine-grained categories known to be problematic. The main evaluation scores come from a semi-automatic process, combining automatic reference matching with extensive manual annotation of uncertain cases. We find that current NMT systems are good at translating pronouns with intra-sentential reference, but the inter-sentential cases remain difficult. NMT systems are also good at the translation of event pronouns, unlike systems from the phrase-based SMT paradigm. No single system performs best at translating all types of anaphoric pronouns, suggesting unexplained random effects influencing the translation of pronouns with NMT.
This paper focuses on the interaction of chains of coreference identity with other types of relations, comparing English and German data sets in terms of language, mode (written vs. spoken) and register. We first describe the types of coreference and the chain features analysed as indicators of textual coherence and topic continuity. After sketching the feature categories under analysis and the methods used for statistical evaluation, we present the findings from our analysis and interpret them in terms of the contrasts mentioned above. We will also show that for some registers, coreference types other than identity are of great importance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.