Learning efficient representations for concepts has been proven to be an important basis for many applications such as machine translation or document classification. Proper representations of medical concepts such as diagnosis, medication, procedure codes and visits will have broad applications in healthcare analytics. However, in Electronic Health Records (EHR) the visit sequences of patients include multiple concepts (diagnosis, procedure, and medication codes) per visit. This structure provides two types of relational information, namely sequential order of visits and co-occurrence of the codes within each visit. In this work, we propose Med2Vec, which not only learns distributed representations for both medical codes and visits from a large EHR dataset with over 3 million visits, but also allows us to interpret the learned representations confirmed positively by clinical experts. In the experiments, Med2Vec displays significant improvement in key medical applications compared to popular baselines such as Skipgram, GloVe and stacked autoencoder, while providing clinically meaningful interpretation.
No abstract
Learning temporal causal structures between time series is one of the key tools for analyzing time series data. In many real-world applications, we are confronted with Irregular Time Series, whose observations are not sampled at equally-spaced time stamps. The irregularity in sampling intervals violates the basic assumptions behind many models for structure learning. In this paper, we propose a nonparametric generalization of the Granger graphical models called Generalized Lasso Granger (GLG) to uncover the temporal dependencies from irregular time series. Via theoretical analysis and extensive experiments, we verify the effectiveness of our model. Furthermore, we apply GLG to the application dataset of δ 18 O isotope of Oxygen records in Asia and achieve promising results to discover the moisture transportation patterns in a 800-year period.
Learning temporal causal structures among multiple time series is one of the major tasks in mining time series data. Granger causality is one of the most popular techniques in uncovering the temporal dependencies among time series; however it faces two main challenges: (i) the spurious effect of unobserved time series and (ii) the computational challenges in high dimensional settings. In this paper, we utilize the confounder path delays to find a subset of time series that via conditioning on them we are able to cancel out the spurious confounder effects. After study of consistency of different Granger causality techniques, we propose Copula-Granger and show that while it is consistent in high dimensions, it can efficiently capture non-linearity in the data. Extensive experiments on a synthetic and a social networking dataset confirm our theoretical results. IntroductionIn the era of data deluge, we are confronted with largescale time series data, i.e., a sequence observations of concerned variables over a period of time. For example, terabytes of neural activity time series data are produced to record the collective response of neurons to different stimuli; petabytes of climate and meteorological data, such as temperature, solar radiation, and precipitation, are collected over the years; and exabytes of social media contents are generated over time on the Internet. A major data mining task for time series data is to uncover the temporal causal relationship among the time series. For example, in the climatology, we want to identify the factors that impact the climate patterns of certain regions. In social networks, we are interested in identification of the patterns of influence among users and how topics activate or suppress each other. Developing effective and scalable data mining algorithms to uncover temporal dependency structures between time series and reveal insights from data has become a key problem in machine learning and data mining.There are two major challenges in discovering temporal causal relationship in large-scale data: (i) not all influential confounders are observed in the datasets and
Abstract-Transductive transfer learning is one special type of transfer learning problem, in which abundant labeled examples are available in the source domain and only unlabeled examples are available in the target domain. It easily finds applications in spam filtering, microblogging mining and so on. In this paper, we propose a general framework to solve the problem by mapping the input features in both the source domain and target domain into a shared latent space and simultaneously minimizing the feature reconstruction loss and prediction loss. We develop one specific example of the framework, namely latent large-margin transductive transfer learning (LATTL) algorithm, and analyze its theoretic bound of classification loss via Rademacher complexity. We also provide a unified view of several popular transfer learning algorithms under our framework. Experiment results on one synthetic dataset and three application datasets demonstrate the advantages of the proposed algorithm over the other stateof-the-art ones.
While protein sequence data is an emerging application domain for machine learning methods, small modifications to protein sequences can result in difficult-to-predict changes to the protein's function. Consequently, protein machine learning models typically do not use randomized data augmentation procedures analogous to those used in computer vision or natural language, e.g., cropping or synonym substitution. In this paper, we empirically explore a set of simple string manipulations, which we use to augment protein sequence data when fine-tuning semi-supervised protein models. We provide 276 different comparisons to the Tasks Assessing Protein Embeddings (TAPE) baseline models, with Transformer-based models and training datasets that vary from the baseline methods only in the data augmentations and representation learning procedure. For each TAPE validation task, we demonstrate improvements to the baseline scores when the learned protein representation is fixed between tasks. We also show that contrastive learning fine-tuning methods typically outperform masked-token prediction in these models, with increasing amounts of data augmentation generally improving performance for contrastive learning protein methods. We find the most consistent results across TAPE tasks when using domain-motivated transformations, such as amino acid replacement, as well as restricting the Transformer attention to randomly sampled sub-regions of the protein sequence. In rarer cases, we even find that information-destroying augmentations, such as randomly shuffling entire protein sequences, can improve downstream performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.