Shang Gao scite author profile

BackgroundWe examine the problem of clustering biomolecular simulations using deep learning techniques. Since biomolecular simulation datasets are inherently high dimensional, it is often necessary to build low dimensional representations that can be used to extract quantitative insights into the atomistic mechanisms that underlie complex biological processes.ResultsWe use a convolutional variational autoencoder (CVAE) to learn low dimensional, biophysically relevant latent features from long time-scale protein folding simulations in an unsupervised manner. We demonstrate our approach on three model protein folding systems, namely Fs-peptide (14 μs aggregate sampling), villin head piece (single trajectory of 125 μs) and β- β- α (BBA) protein (223 + 102 μs sampling across two independent trajectories). In these systems, we show that the CVAE latent features learned correspond to distinct conformational substates along the protein folding pathways. The CVAE model predicts, on average, nearly 89% of all contacts within the folding trajectories correctly, while being able to extract folded, unfolded and potentially misfolded states in an unsupervised manner. Further, the CVAE model can be used to learn latent features of protein folding that can be applied to other independent trajectories, making it particularly attractive for identifying intrinsic features that correspond to conformational substates that share similar structural features.ConclusionsTogether, we show that the CVAE model can quantitatively describe complex biophysical processes such as protein folding.

show abstract

Hierarchical attention networks for information extraction from cancer pathology reports

Gao

Young

Qiu

et al. 2017

111

View full text Add to dashboard Cite

show abstract

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks

Alawad

Gao

Qiu

et al. 2019

View full text Add to dashboard Cite

Objective We implement 2 different multitask learning (MTL) techniques, hard parameter sharing and cross-stitch, to train a word-level convolutional neural network (CNN) specifically designed for automatic extraction of cancer data from unstructured text in pathology reports. We show the importance of learning related information extraction (IE) tasks leveraging shared representations across the tasks to achieve state-of-the-art performance in classification accuracy and computational efficiency. Materials and Methods Multitask CNN (MTCNN) attempts to tackle document information extraction by learning to extract multiple key cancer characteristics simultaneously. We trained our MTCNN to perform 5 information extraction tasks: (1) primary cancer site (65 classes), (2) laterality (4 classes), (3) behavior (3 classes), (4) histological type (63 classes), and (5) histological grade (5 classes). We evaluated the performance on a corpus of 95 231 pathology documents (71 223 unique tumors) obtained from the Louisiana Tumor Registry. We compared the performance of the MTCNN models against single-task CNN models and 2 traditional machine learning approaches, namely support vector machine (SVM) and random forest classifier (RFC). Results MTCNNs offered superior performance across all 5 tasks in terms of classification accuracy as compared with the other machine learning models. Based on retrospective evaluation, the hard parameter sharing and cross-stitch MTCNN models correctly classified 59.04% and 57.93% of the pathology reports respectively across all 5 tasks. The baseline models achieved 53.68% (CNN), 46.37% (RFC), and 36.75% (SVM). Based on prospective evaluation, the percentages of correctly classified cases across the 5 tasks were 60.11% (hard parameter sharing), 58.13% (cross-stitch), 51.30% (single-task CNN), 42.07% (RFC), and 35.16% (SVM). Moreover, hard parameter sharing MTCNNs outperformed the other models in computational efficiency by using about the same number of trainable parameters as a single-task CNN. Conclusions The hard parameter sharing MTCNN offers superior classification accuracy for automated coding support of pathology documents across a wide range of cancers and multiple information extraction tasks while maintaining similar training and inference time as those of a single task–specific model.

show abstract

Limitations of Transformers on Clinical Text Classification

Gao

Alawad

Young

et al. 2021

IEEE J. Biomed. Health Inform.

View full text Add to dashboard Cite

Classifying cancer pathology reports with hierarchical self-attention networks

Gao

Qiu

Alawad

et al. 2019

Artificial Intelligence in Medicine

View full text Add to dashboard Cite

Calculation of thermodynamic properties of hydrated borates by group contribution method

Gao

2000

Physics and Chemistry of Minerals

View full text Add to dashboard Cite

Thermochemistry of hydrated magnesium borates

Li¹,

Gao²,

Shuping³

et al. 1997

The Journal of Chemical Thermodynamics

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Shang Gao

FT-IR and Raman spectroscopic study of hydrated borates

Deep clustering of protein folding simulations

Hierarchical attention networks for information extraction from cancer pathology reports

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks

Limitations of Transformers on Clinical Text Classification

Classifying cancer pathology reports with hierarchical self-attention networks

Calculation of thermodynamic properties of hydrated borates by group contribution method

Thermochemistry of hydrated magnesium borates

Contact Info

Product

Resources

About