If you already have a Python installation with a different version (e.g., 2.7) that you must keep, consider installing Python 3.8 through Anaconda ("Anaconda Software Distribution," 2020): https:// docs.anaconda.com/ anaconda/ install. Download required files.Through your browser, navigate to http:// data.bioembeddings.com/ disprot and download the files: sequences.fasta, config.yml, and dis-prot_annotations.csv.Note that you might need to right click and select "Save Link As" to download the files.
In this work, we examine the extent to which embeddings may encode marginalized populations differently, and how this may lead to a perpetuation of biases and worsened performance on clinical tasks. We pretrain deep embedding models (BERT) on medical notes from the MIMIC-III hospital dataset, and quantify potential disparities using two approaches. First, we identify dangerous latent relationships that are captured by the contextual word embeddings using a fill-in-the-blank method with text from real clinical notes and a log probability bias score quantification. Second, we evaluate performance gaps across different definitions of fairness on over 50 downstream clinical prediction tasks that include detection of acute and chronic conditions. We find that classifiers trained from BERT representations exhibit statistically significant differences in performance, often favoring the majority group with regards to gender, language, ethnicity, and insurance status. Finally, we explore shortcomings of using adversarial debiasing to obfuscate subgroup information in contextual word embeddings, and recommend best practices for such deep embedding models in clinical settings.
A major challenge to the characterization of intrinsically disordered regions (IDRs), which are widespread in the proteome, but relatively poorly understood, is the identification of molecular features that mediate functions of these regions, such as short motifs, amino acid repeats and physicochemical properties. Here, we introduce a proteome-scale feature discovery approach for IDRs. Our approach, which we call “reverse homology”, exploits the principle that important functional features are conserved over evolution. We use this as a contrastive learning signal for deep learning: given a set of homologous IDRs, the neural network has to correctly choose a held-out homolog from another set of IDRs sampled randomly from the proteome. We pair reverse homology with a simple architecture and standard interpretation techniques, and show that the network learns conserved features of IDRs that can be interpreted as motifs, repeats, or bulk features like charge or amino acid propensities. We also show that our model can be used to produce visualizations of what residues and regions are most important to IDR function, generating hypotheses for uncharacterized IDRs. Our results suggest that feature discovery using unsupervised neural networks is a promising avenue to gain systematic insight into poorly understood protein sequences.
A major challenge to the characterization of intrinsically disordered regions (IDRs), which are widespread in the proteome, but relatively poorly understood, is the identification of molecular features, such as short motifs, amino acid repeats and physicochemical properties that mediate the functions of these regions. Here, we introduce a proteome-scale feature discovery method for IDRs. Our method, which we call "reverse homology", exploits the principle that important functional features are conserved over evolution as a contrastive learning signal for deep learning: given a set of homologous IDRs, the neural network has to correctly choose a randomly held-out homologue from another set of IDRs sampled randomly from the proteome. We pair reverse homology with a simple architecture and interpretation techniques, and show that the network learns conserved features of IDRs that can be interpreted as motifs, repeats, and other features. We also show that our model can be used to produce specific predictions of what residues and regions are most important to the function, providing a computational strategy for designing mutagenesis experiments in uncharacterized IDRs. Our results suggest that feature discovery using neural networks is a promising avenue to gain systematic insight into poorly understood protein sequences.
ObjectiveTo characterise the early diffusion of indirect comparison meta-analytic methods to study drugs.DesignSystematic literature synthesis.Data sourcesCochrane Database of Systematic Reviews, EMBASE, MEDLINE, Scopus and Web of Science.Study selectionEnglish language papers that used indirect comparison meta-analytic methods to study the efficacy or safety of three or more interventions, where at least one was a drug.Data extractionThe number of publications and authors was plotted by year and type: methodological contribution, review or empirical application. Author and methodological details were summarised for empirical applications, and animated coauthorship networks were created to visualise contributors by country and affiliation type (academia, industry, government or other) over time.ResultsWe identified 477 papers (74 methodological contributions, 42 reviews and 361 empirical applications) by 1689 distinct authors from 1997 to 2013. Prior to 2002, only three applications were published, with contributions from the USA (n=2) and Canada (n=1). The number of applications gradually increased annually with rapid uptake between 2011 and 2013 (n=254, 71%). Early diffusion occurred primarily in Europe with the first application credited to the UK in 2003. Application spread to other European countries in 2005, and may have been supported by regulatory requirements for drug approval. By the end of 2013, contributions included 49% credited to Europe (22% UK, 27% other), 37% credited to North America (11% Canada, 26% USA) and 14% from other regions.ConclusionIndirect comparison meta-analytic methods are an important innovation for health research. Although Canada and the USA were the first to apply these methods, Europe led their diffusion. The increase in uptake of these methods may have been facilitated by acceptance by regulatory agencies, which are calling for more comparative drug effect data to assist in drug accessibility and reimbursement decisions.
Advances in gene delivery technologies are enabling rapid progress in molecular medicine, but require precise expression of genetic cargo in desired cell types, which is predominantly achieved via a regulatory DNA sequence called a promoter; however, only a handful of cell type-specific promoters are known. Efficiently designing compact promoter sequences with a high density of regulatory information by leveraging machine learning models would therefore be broadly impactful for fundamental research and direct therapeutic applications. However, models of expression from such compact promoter sequences are lacking, despite the recent success of deep learning in modelling expression from endogenous regulatory sequences. Despite the lack of large datasets measuring promoter-driven expression in many cell types, data from a few well-studied cell types or from endogenous gene expression may provide relevant information for transfer learning, which has not yet been explored in this setting. Here, we evaluate a variety of pretraining tasks and transfer strategies for modelling cell type-specific expression from compact promoters and demonstrate the effectiveness of pretraining on existing promoter-driven expression datasets from other cell types. Our approach is broadly applicable for modelling promoter-driven expression in any data-limited cell type of interest, and will enable the use of model-based optimization techniques for promoter design for gene delivery applications. Our code and data are available at https://github.com/anikethjr/promoter_models.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.