This paper presents a new challenging information extraction task in the domain of materials science. We develop an annotation scheme for marking information on experiments related to solid oxide fuel cells in scientific publications, such as involved materials and measurement conditions. With this paper, we publish our annotation guidelines, as well as our SOFC-Exp corpus consisting of 45 openaccess scholarly articles annotated by domain experts. A corpus and an inter-annotator agreement study demonstrate the complexity of the suggested named entity recognition and slot filling tasks as well as high annotation quality. We also present strong neural-network based models for a variety of tasks that can be addressed on the basis of our new data set. On all tasks, using BERT embeddings leads to large performance gains, but with increasing task complexity, adding a recurrent neural network on top seems beneficial. Our models will serve as competitive baselines in future work, and analysis of their performance highlights difficult cases when modeling the data and suggests promising research directions.
Motivation: Predicting gene expression from DNA is an open field of research. As in many areas, labeled data is dwarfed by unlabelled data, i.e. species with a sequenced genome but no gene expression assay data. Pretraining on unlabelled data using masked language modeling has proven highly successful in overcoming data constraints in natural language and proteomics. However, in genomics, this approach has so far been applied only to single genomes, neither leveraging conservation of regulatory sequences across species nor the vast amount of available genomes. Results: Here we train a masked language model on more than 800 species spanning over 500 million years of evolution. We show that explicitly modeling species is instrumental in capturing conserved yet evolving regulatory elements and in controlling for oligomer biases. We extract embeddings for 3' untranslated regions of Saccharomyces cerevisiae and Schizosaccharomyces pombe and use them to achieve prediction of mRNA half-life that is better or on-par with the state-of-the-art, demonstrating the utility of the approach for regulatory genomics. Moreover, we show that the per-base reconstruction probability of our model significantly predicts RNA-binding protein bound sites directly. Altogether, our work establishes a self-supervised framework to leverage large genome collections of evolutionary distant species for regulatory genomics and contributes to alignment-free comparative genomics. Availability and implementation: The source code and trained models are available at: https://github.com/DennisGankin/species-aware-DNA-LM .
Unlike for DNA and RNA, accurate and high-throughput sequencing methods for proteins are lacking, hindering the utility of proteomics in applications where the sequences are unknown including variant calling, neoepitope identification, and metaproteomics. We introduce Spectralis, a new de novo peptide sequencing method for tandem mass spectrometry. Spectralis leverages several innovations including a new convolutional neural network layer connecting peaks in spectra spaced by amino acid masses, proposing fragment ion series classification as a pivotal task for de novo peptide sequencing, and a new peptide-spectrum confidence score. On spectra for which database search provided a ground truth, Spectralis surpassed 40% sensitivity at 90% precision, nearly doubling state-of-the-art sensitivity. Application to unidentified spectra confirmed its superiority and showcased its applicability to variant calling. Altogether, these algorithmic innovations and the substantial sensitivity increase in the high-precision range constitute an important step toward broadly applicable peptide sequencing.
This paper presents a new challenging information extraction task in the domain of materials science. We develop an annotation scheme for marking information on experiments related to solid oxide fuel cells in scientific publications, such as involved materials and measurement conditions. With this paper, we publish our annotation guidelines, as well as our SOFC-Exp corpus consisting of 45 openaccess scholarly articles annotated by domain experts. A corpus and an inter-annotator agreement study demonstrate the complexity of the suggested named entity recognition and slot filling tasks as well as high annotation quality. We also present strong neural-network based models for a variety of tasks that can be addressed on the basis of our new data set. On all tasks, using BERT embeddings leads to large performance gains, but with increasing task complexity, adding a recurrent neural network on top seems beneficial. Our models will serve as competitive baselines in future work, and analysis of their performance highlights difficult cases when modeling the data and suggests promising research directions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.