Most languages have no established writing system and minimal written records. However, textual data is essential for natural language processing, and particularly important for training language models to support speech recognition. Even in cases where text data is missing, there are some languages for which bilingual lexicons are available, since creating lexicons is a fundamental task of documentary linguistics. We investigate the use of such lexicons to improve language models when textual training data is limited to as few as a thousand sentences. The method involves learning cross-lingual word embeddings as a preliminary step in training monolingual language models. Results across a number of languages show that language models are improved by this pre-training. Application to Yongning Na, a threatened language, highlights challenges in deploying the approach in real low-resource environments.
Termination of RNA polymerase II (Pol II) transcription is an important step in the transcription cycle, which involves the dislodgement of polymerase from DNA, leading to release of a functional transcript. Recent studies have identified the key players required for this process and showed that a common feature of these proteins is a conserved domain that interacts with the phosphorylated C-terminus of Pol II (CTD-interacting domain, CID). However, the mechanism by which transcription termination is achieved is not understood. Using genome-wide methods, here we show that the fission yeast CID-protein Seb1 is essential for termination of protein-coding and non-coding genes through interaction with S2-phosphorylated Pol II and nascent RNA. Furthermore, we present the crystal structures of the Seb1 CTD- and RNA-binding modules. Unexpectedly, the latter reveals an intertwined two-domain arrangement of a canonical RRM and second domain. These results provide important insights into the mechanism underlying eukaryotic transcription termination.
We report on adaptation of multilingual endto-end speech recognition models trained on as many as 100 languages. Our findings shed light on the relative importance of similarity between the target and pretraining languages along the dimensions of phonetics, phonology, language family, geographical location, and orthography. In this context, experiments demonstrate the effectiveness of two additional pretraining objectives in encouraging language-independent encoder representations: a context-independent phoneme objective paired with a language-adversarial classification objective. 2 festvox.org/cmu_wilderness/index.html
Highlights d Cryo-EM structure of M. tuberculosis MmpL3 determined at 3.0 Å resolution d An LMNG molecule within the periplasmic cavity suggests the TMM export pathway d Comprehensive structural mapping of resistance-conferring MmpL3 variants d Genome-mined MmpL3 mutations indicate minimal preexisting resistance
Proliferating smartphones and mobile software offer linguists a scalable, networked recording device. This paper describes Aikuma, a mobile app that is designed to put the key language documentation tasks of recording, respeaking, and translating in the hands of a speech community. After motivating the approach we describe the system and briefly report on its use in field tests.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.