Genetic variation can predispose to disease both through (i) monogenic risk variants that disrupt a physiologic pathway with large effect on disease and (ii) polygenic risk that involves many variants of small effect in different pathways. Few studies have explored the interplay between monogenic and polygenic risk. Here, we study 80,928 individuals to examine whether polygenic background can modify penetrance of disease in tier 1 genomic conditionsfamilial hypercholesterolemia, hereditary breast and ovarian cancer, and Lynch syndrome. Among carriers of a monogenic risk variant, we estimate substantial gradients in disease risk based on polygenic backgroundthe probability of disease by age 75 years ranged from 17% to 78% for coronary artery disease, 13% to 76% for breast cancer, and 11% to 80% for colon cancer. We propose that accounting for polygenic background is likely to increase accuracy of risk estimation for individuals who inherit a monogenic risk variant.
Objective: Lipoprotein(a) concentrations are associated with atherosclerotic cardiovascular disease (ASCVD), and new therapies that enable potent and specific reduction are in development. In the largest study conducted to date, we address 3 areas of uncertainty: (1) the magnitude and shape of ASCVD risk conferred across the distribution of lipoprotein(a) concentrations; (2) variation of risk across racial and clinical subgroups; (3) clinical importance of a high lipoprotein(a) threshold to guide therapy. Approach and Results: Relationship of lipoprotein(a) to incident ASCVD studied in 460 506 middle-aged UK Biobank participants. Over a median follow-up of 11.2 years, incident ASCVD occurred in 22 401 (4.9%) participants. Median lipoprotein(a) concentration was 19.6 nmol/L (25th–75th percentile 7.6–74.8). The relationship between lipoprotein(a) and ASCVD appeared linear across the distribution, with a hazard ratio of 1.11 (95% CI, 1.10–1.12) per 50 nmol/L increment. Substantial differences in concentrations were noted according to race—median values for white, South Asian, black, and Chinese individuals were 19, 31, 75, and 16 nmol/L, respectively. However, risk per 50 nmol/L appeared similar—hazard ratios of 1.11, 1.10, and 1.07 for white, South Asian, and black individuals, respectively. A high lipoprotein(a) concentration defined as ≥150 nmol/L was present in 12.2% of those without and 20.3% of those with preexisting ASCVD and associated with hazard ratios of 1.50 (95% CI, 1.44–1.56) and 1.16 (95% CI, 1.05–1.27), respectively. Conclusions: Lipoprotein(a) concentrations predict incident ASCVD among middle-aged adults within primary and secondary prevention contexts, with a linear risk gradient across the distribution. Concentrations are variable across racial subgroups, but the associated risk appears similar.
This thesis explores approaches to the problem of spoken document retrieval (SDR), which is the task of automatically indexing and then retrieving relevant items from a large collection of recorded speech messages in response to a user specified natural language text query. We investigate the use of subword unit representations for SDR as an alternative to words generated by either keyword spotting or continuous speech recognition. Our investigation is motivated by the observation that word-based retrieval approaches face the problem of either having to know the keywords to search for a priori, or requiring a very large recognition vocabulary in order to cover the contents of growing and diverse message collections. The use of subword units in the recognizer constrains the size of the vocabulary needed to cover the language; and the use of subword units as indexing terms allows for the detection of new user-specified query terms during retrieval.Four research issues are addressed. First, what are suitable subword units and how well can they perform? Second, how can these units be reliably extracted from the speech signal? Third, what is the behavior of the subword units when there are speech recognition errors and how well do they perform? And fourth, how can the indexing and retrieval methods be modified to take into account the fact that the speech recognition output will be errorful?We first explore a range of subword units of varying complexity derived from error-free phonetic transcriptions and measure their ability to effectively index and retrieve speech messages. We find that many subword units capture enough information to perform effective retrieval and that it is possible to achieve performance comparable to that of text-based word units. Next, we develop a phonetic speech recognizer and process the spoken document collection to generate phonetic transcriptions. We then measure the ability of subword units derived from these transcriptions to perform spoken document retrieval and examine the effects of recognition errors on retrieval performance. Retrieval performance degrades for all subword units (to 60% of the clean reference), but remains reasonable for some subword units even without the use of any error compensation techniques. We then investigate a number of robust methods that take into account the characteristics of the recognition errors and try to compensate for them in an effort to improve spoken document retrieval performance when there are speech recognition errors. We study the methods individually and explore the effects of combining them. Using these robust methods improves retrieval performance by 23%. We also propose a novel approach to SDR where the speech recognition and information retrieval components are more tightly integrated. This is accomplished by developing new recognizer and retrieval models where the interface between the two 3 components is better matched and the goals of the two components are consistent with each other and with the overall goal of the combine...
Clustering, the process of grouping together similar items into distinct partitions, is a common type of unsupervised machine learning that can be useful for summarizing and aggregating complex multi-dimensional data. However, data can be clustered in many ways, and there exist a large body of algorithms designed to reveal different patterns. While having access to a wide variety of algorithms is helpful, in practice, it is quite difficult for data scientists to choose and parameterize algorithms to get the clustering results relevant for their dataset and analytical tasks. To alleviate this problem, we built Clustervision, a visual analytics tool that helps ensure data scientists find the right clustering among the large amount of techniques and parameters available. Our system clusters data using a variety of clustering techniques and parameters and then ranks clustering results utilizing five quality metrics. In addition, users can guide the system to produce more relevant results by providing task-relevant constraints on the data. Our visual user interface allows users to find high quality clustering results, explore the clusters using several coordinated visualization techniques, and select the cluster result that best suits their task. We demonstrate this novel approach using a case study with a team of researchers in the medical domain and showcase that our system empowers users to choose an effective representation of their complex data.
IMPORTANCE Pathogenic DNA variants associated with familial hypercholesterolemia, hereditary breast and ovarian cancer syndrome, and Lynch syndrome are widely recognized as clinically important and actionable when identified, leading some clinicians to recommend population-wide genomic screening. OBJECTIVES To assess the prevalence and clinical importance of pathogenic or likely pathogenic variants associated with each of 3 genomic conditions (familial hypercholesterolemia, hereditary breast and ovarian cancer syndrome, and Lynch syndrome) within the context of contemporary clinical care. DESIGN, SETTING, AND PARTICIPANTS This cohort study used gene-sequencing data from 49 738 participants in the UK Biobank who were recruited from 22 sites across the UK between March 21,
Background: Individuals of South Asian ancestry represent 23% of the global population, corresponding to 1.8 billion people, and have substantially higher risk of atherosclerotic cardiovascular disease compared with most other ethnicities. US practice guidelines now recognize South Asian ancestry as an important risk-enhancing factor. The magnitude of enhanced risk within the context of contemporary clinical care, the extent to which it is captured by existing risk estimators, and its potential mechanisms warrant additional study. Methods: Within the UK Biobank prospective cohort study, 8124 middle-aged participants of South Asian ancestry and 449 349 participants of European ancestry who were free of atherosclerotic cardiovascular disease at the time of enrollment were examined. The relationship of ancestry to risk of incident atherosclerotic cardiovascular disease—defined as myocardial infarction, coronary revascularization, or ischemic stroke—was assessed with Cox proportional hazards regression, along with examination of a broad range of clinical, anthropometric, and lifestyle mediators. Results: The mean age at study enrollment was 57 years, and 202 405 (44%) were male. Over a median follow-up of 11 years, 554 of 8124 (6.8%) individuals of South Asian ancestry experienced an atherosclerotic cardiovascular disease event compared with 19 756 of 449 349 (4.4%) individuals of European ancestry, corresponding to an adjusted hazard ratio of 2.03 (95% CI, 1.86–2.22; P <0.001). This higher relative risk was largely consistent across a range of age, sex, and clinical subgroups. Despite the >2-fold higher observed risk, the predicted 10-year risk of cardiovascular disease according to the American Heart Association/American College of Cardiology Pooled Cohort equations and QRISK3 equations was nearly identical for individuals of South Asian and European ancestry. Adjustment for a broad range of clinical, anthropometric, and lifestyle risk factors led to only modest attenuation of the observed hazard ratio to 1.45 (95% CI, 1.28–1.65, P <0.001). Assessment of variance explained by 18 candidate risk factors suggested greater importance of hypertension, diabetes, and central adiposity in South Asian individuals. Conclusions: Within a large prospective study, South Asian individuals had substantially higher risk of atherosclerotic cardiovascular disease compared with individuals of European ancestry, and this risk was not captured by the Pooled Cohort Equations.
Objective: To determine the relationship of a genome-wide polygenic score for coronary artery disease (GPS CAD ) with lifetime trajectories of CAD risk, directly compare its predictive capacity to traditional risk factors, and assess its interplay with the Pooled Cohort Equations (PCE) clinical risk estimator. Approach and Results: We studied GPS CAD in 28 556 middle-aged participants of the Malmö Diet and Cancer Study, of whom 4122 (14.4%) developed CAD over a median follow-up of 21.3 years. A pronounced gradient in lifetime risk of CAD was observed—16% for those in the lowest GPS CAD decile to 48% in the highest. We evaluated the discriminative capacity of the GPS CAD —as assessed by change in the C-statistic from a baseline model including age and sex—among 5685 individuals with PCE risk estimates available. The increment for the GPS CAD (+0.045, P <0.001) was higher than for any of 11 traditional risk factors (range +0.007 to +0.032). Minimal correlation was observed between GPS CAD and 10-year risk defined by the PCE ( r =0.03), and addition of GPS CAD improved the C-statistic of the PCE model by 0.026. A significant gradient in lifetime risk was observed for the GPS CAD , even among individuals within a given PCE clinical risk stratum. We replicated key findings—noting strikingly consistent results—in 325 003 participants of the UK Biobank. Conclusions: GPS CAD —a risk estimator available from birth—stratifies individuals into varying trajectories of clinical risk for CAD. Implementation of GPS CAD may enable identification of high-risk individuals early in life, decades in advance of manifest risk factors or disease.
Background The electronic health record contains a tremendous amount of data that if appropriately detected can lead to earlier identification of disease states such as heart failure (HF). Using a novel text and data analytic tool we explored the longitudinal EHR of over 50,000 primary care patients to identify the documentation of the signs and symptoms of HF in the years preceding its diagnosis. Methods and Results Retrospective analysis consisting of 4,644 incident HF cases and 45,981 group-matched controls. Documentation of Framingham HF signs and symptoms within encounter notes were carried out using a previously validated natural language processing procedure. A total of 892,805 affirmed criteria were documented over an average observation period of 3.4 years. Among eventual HF cases, 85% had at least one criterion within a year prior to their HF diagnosis (as did 55% of controls). Substantial variability in the prevalence of individual signs and symptoms were found in both cases and controls. Conclusions HF signs and symptoms are frequently documented in a primary care population as identified through automated text and data mining of EHRs. Their frequent identification demonstrates the rich data available within EHRs that will allow for future work on automated criterion identification to help develop predictive models for HF.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.