Objective: HDL in plasma is a heterogeneous group of lipoproteins typically containing apoA-I as the principal protein. Most HDLs contain additional proteins from a palate of nearly 100 HDL-associated polypeptides. We hypothesized that some of these proteins define distinct and stable apoA-I HDL subspecies with unique proteomes that drive function and associations with disease. Approach and Results: We produced 17 plasma pools from 80 normolipidemic human participants (32 male, 48 female; aged 21 to 66 years). Using immunoaffinity isolation techniques, we isolated apoA-I containing species from plasma and then used antibodies to 16 additional HDL protein components to isolate compositional subspecies. We characterized previously described HDL subspecies containing apoA-II, apoC-III and apoE; and 13 novel HDL subspecies defined by presence of apoA-IV, apoC-I, apoC-II, apoJ, alpha-1-antitrypsin, alpha-2-macroglobulin, plasminogen, fibrinogen, ceruloplasmin, haptoglobin, paraoxonase-1, apoL-I, or complement C3. The novel species ranged in abundance from 1–18% of total plasma apoA-I. Their concentrations were stable over time as demonstrated by intra-class correlations in repeated sampling from the same participants over 3–24 months (0.33 – 0.86; mean 0.62). Some proteomes of the subspecies relative to total HDL were strongly correlated, often among subspecies defined by similar functions: lipid metabolism, hemostasis, anti-oxidant, or anti-inflammatory. Permutation analysis showed that the proteomes of 12 of the 16 subspecies differed significantly from that of total HDL. Conclusions: Taken together, correlation and permutation analyses support speciation of HDL. Functional studies of these novel subspecies and determination of their relation to diseases may provide new avenues to understand the HDL system of lipoproteins.
Objective: HDL (high-density lipoprotein) contains functional proteins that define single subspecies, each comprising 1% to 12% of the total HDL. We studied the differential association with coronary heart disease (CHD) of 15 such subspecies. Approach and Results: We measured plasma apoA1 (apolipoprotein A1) concentrations of 15 protein-defined HDL subspecies in 4 US-based prospective studies. Among participants without CVD at baseline, 932 developed CHD during 10 to 25 years. They were matched 1:1 to controls who did not experience CHD. In each cohort, hazard ratios for each subspecies were computed by conditional logistic regression and combined by meta-analysis. Higher levels of HDL subspecies containing alpha-2 macroglobulin, CoC3 (complement C3), HP (haptoglobin), or PLMG (plasminogen) were associated with higher relative risk compared with the HDL counterpart lacking the defining protein (hazard ratio range, 0.96–1.11 per 1 SD increase versus 0.73–0.81, respectively; P for heterogeneity <0.05). In contrast, HDL containing apoC1 or apoE were associated with lower relative risk compared with the counterpart (hazard ratio, 0.74; P =0.002 and 0.77, P =0.001, respectively). Conclusions: Several subspecies of HDL defined by single proteins that are involved in thrombosis, inflammation, immunity, and lipid metabolism are found in small fractions of total HDL and are associated with higher relative risk of CHD compared with HDL that lacks the defining protein. In contrast, HDL containing apoC1 or apoE are robustly associated with lower risk. The balance between beneficial and harmful subspecies in a person’s HDL sample may determine the risk of CHD pertaining to HDL and paths to treatment.
Objective A major bottleneck hindering utilization of electronic health record data for translational research is the lack of precise phenotype labels. Chart review as well as rule-based and supervised phenotyping approaches require laborious expert input, hampering applicability to studies that require many phenotypes to be defined and labeled de novo. Though International Classification of Diseases codes are often used as surrogates for true labels in this setting, these sometimes suffer from poor specificity. We propose a fully automated topic modeling algorithm to simultaneously annotate multiple phenotypes. Materials and Methods Surrogate-guided ensemble latent Dirichlet allocation (sureLDA) is a label-free multidimensional phenotyping method. It first uses the PheNorm algorithm to initialize probabilities based on 2 surrogate features for each target phenotype, and then leverages these probabilities to constrain the LDA topic model to generate phenotype-specific topics. Finally, it combines phenotype-feature counts with surrogates via clustering ensemble to yield final phenotype probabilities. Results sureLDA achieves reliably high accuracy and precision across a range of simulated and real-world phenotypes. Its performance is robust to phenotype prevalence and relative informativeness of surogate vs nonsurrogate features. It also exhibits powerful feature selection properties. Discussion sureLDA combines attractive properties of PheNorm and LDA to achieve high accuracy and precision robust to diverse phenotype characteristics. It offers particular improvement for phenotypes insufficiently captured by a few surrogate features. Moreover, sureLDA’s feature selection ability enables it to handle high feature dimensions and produce interpretable computational phenotypes. Conclusions sureLDA is well suited toward large-scale electronic health record phenotyping for highly multiphenotype applications such as phenome-wide association studies .
The increasing availability of electronic health record (EHR) systems has created enormous potential for translational research. However, it is difficult to know all the relevant codes related to a phenotype due to the large number of codes available. Traditional data mining approaches often require the use of patient-level data, which hinders the ability to share data across institutions. In this project, we demonstrate that multi-center large-scale code embeddings can be used to efficiently identify relevant features related to a disease of interest. We constructed large-scale code embeddings for a wide range of codified concepts from EHRs from two large medical centers. We developed knowledge extraction via sparse embedding regression (KESER) for feature selection and integrative network analysis. We evaluated the quality of the code embeddings and assessed the performance of KESER in feature selection for eight diseases. Besides, we developed an integrated clinical knowledge map combining embedding data from both institutions. The features selected by KESER were comprehensive compared to lists of codified data generated by domain experts. Features identified via KESER resulted in comparable performance to those built upon features selected manually or with patient-level data. The knowledge map created using an integrative analysis identified disease-disease and disease-drug pairs more accurately compared to those identified using single institution data. Analysis of code embeddings via KESER can effectively reveal clinical knowledge and infer relatedness among codified concepts. KESER bypasses the need for patient-level data in individual analyses providing a significant advance in enabling multi-center studies using EHR data.
Objective Identifying pseudogout in large data sets is difficult due to its episodic nature and a lack of billing codes specific to this acute subtype of calcium pyrophosphate (CPP) deposition disease. The objective of this study was to evaluate a novel machine learning approach for classifying pseudogout using electronic health record (EHR) data. Methods We created an EHR data mart of patients with ≥1 relevant billing code or ≥2 natural language processing (NLP) mentions of pseudogout or chondrocalcinosis, 1991–2017. We selected 900 subjects for gold standard chart review for definite pseudogout (synovitis + synovial fluid CPP crystals), probable pseudogout (synovitis + chondrocalcinosis), or not pseudogout. We applied a topic modeling approach to identify definite/probable pseudogout. A combined algorithm included topic modeling plus manually reviewed CPP crystal results. We compared algorithm performance and cohorts identified by billing codes, the presence of CPP crystals, topic modeling, and a combined algorithm. Results Among 900 subjects, 123 (13.7%) had pseudogout by chart review (68 definite, 55 probable). Billing codes had a sensitivity of 65% and a positive predictive value (PPV) of 22% for pseudogout. The presence of CPP crystals had a sensitivity of 29% and a PPV of 92%. Without using CPP crystal results, topic modeling had a sensitivity of 29% and a PPV of 79%. The combined algorithm yielded a sensitivity of 42% and a PPV of 81%. The combined algorithm identified 50% more patients than the presence of CPP crystals; the latter captured a portion of definite pseudogout and missed probable pseudogout. Conclusion For pseudogout, an episodic disease with no specific billing code, combining NLP, machine learning methods, and synovial fluid laboratory results yielded an algorithm that significantly boosted the PPV compared to billing codes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.