Genetic epidemiologists have taken the challenge to identify genetic polymorphisms involved in the development of diseases. Many have collected data on large numbers of genetic markers but are not familiar with available methods to assess their association with complex diseases. Statistical methods have been developed for analyzing the relation between large numbers of genetic and environmental predictors to disease or disease-related variables in genetic association studies.In this commentary we discuss logistic regression analysis, neural networks, including the parameter decreasing method (PDM) and genetic programming optimized neural networks (GPNN) and several non-parametric methods, which include the set association approach, combinatorial partitioning method (CPM), restricted partitioning method (RPM), multifactor dimensionality reduction (MDR) method and the random forests approach. The relative strengths and weaknesses of these methods are highlighted.Logistic regression and neural networks can handle only a limited number of predictor variables, depending on the number of observations in the dataset. Therefore, they are less useful than the non-parametric methods to approach association studies with large numbers of predictor variables. GPNN on the other hand may be a useful approach to select and model important predictors, but its performance to select the important effects in the presence of large numbers of predictors needs to be examined. Both the set association approach and random forests approach are able to handle a large number of predictors and are useful in reducing these predictors to a subset of predictors with an important contribution to disease. The combinatorial methods give more insight in combination patterns for sets of genetic and/or environmental predictor variables that may be related to the outcome variable. As the non-parametric methods have different strengths and weaknesses we conclude that to approach genetic association studies using the case-control design, the application of a combination of several methods, including the set association approach, MDR and the random forests approach, will likely be a useful strategy to find the important genes and interaction patterns involved in complex diseases.
Scabies is a skin infestation with the mite Sarcoptes scabiei causing itch and rash and is a major risk factor for bacterial skin infections and severe complications. Here, we evaluated the treatment outcome of 2866 asylum seekers who received (preventive) scabies treatment before and during a scabies intervention programme (SIP) in the main reception centre in the Netherlands between January 2014 and March 2016. A SIP was introduced in the main national reception centre based on frequent observations of scabies and its complications amongst Eritrean and Ethiopian asylum seekers in the Netherlands. On arrival, all asylum seekers from Eritrea or Ethiopia were checked for clinical scabies signs and received ivermectin/permethrin either as prevention or treatment. A retrospective cohort study was conducted to compare the reinfestations and complications of scabies in asylum seekers who entered the Netherlands before and during the intervention and who received ivermectin/permethrin. In total, 2866 asylum seekers received treatment during the study period (January 2014 –March 2016) of which 1359 (47.4%) had clinical signs of scabies. During the programme, most of the asylum seekers with scabies were already diagnosed on arrival as part of the SIP screening (580 (64.7%) of the 897). Asylum seekers with more than one scabies episode reduced from 42.0% (194/462) before the programme to 27.2% (243/897) during the programme (RR = 0.64, 95% CI = 0.55–0.75). Development of scabies complications later in the asylum procedure reduced from 12.3% (57/462) to 4.6% (41/897). A scabies prevention and treatment programme at start of the asylum procedure was feasible and effective in the Netherlands; patients were diagnosed early and risk of reinfestations and complications reduced. To achieve a further decrease of scabies, implementation of the programme in multiple asylum centres may be needed.
Nonparametric approaches have been developed that are able to analyze large numbers of single nucleotide polymorphisms (SNPs) in modest sample sizes. These approaches have different selection features and may not provide similar results when applied to the same dataset. Therefore, we compared the results of three approaches (set association, random forests and multifactor dimensionality reduction [MDR]) to select from a total of 93 candidate SNPs a subset of SNPs that are important in determining high-density lipoprotein (HDL)-cholesterol levels. The study population consisted of a random sample from a Dutch monitoring project for cardiovascular disease risk factors and was dichotomized into cases (low HDL-cholesterol, n = 533) and non-cases (high HDL-cholesterol, n = 545) based on gender-specific median values for HDL cholesterol. Clearly, all three approaches prioritized three SNPs as important (CETP Taq1B, CETP-629 C/A and LPL Ser447X). Two SNPs with weaker main effects were additionally prioritized by random forests (APOC3 3175 G/C and CCR2 Val62Ile), whereas MTHFR 677 C/T was selected in combination with CETP Taq1B as best model by MDR. Obtained p-values for the selected models were significant for the set association approach (p =.0019), random forests (p<.01) and MDR (p<.02). In conclusion, the application of a combination of multi-locus methods is a useful approach in genetic association studies to select a well-defined set of important SNPs for further statistical and epidemiological interpretation, providing increased confidence and more information compared with the application of only one method.
Rodenburg W, Heidema AG, Boer JM, Bovee-Oudenhoven IM, Feskens EJ, Mariman EC, Keijer J. A framework to identify physiological responses in microarray-based gene expression studies: selection and interpretation of biologically relevant genes.
BackgroundBiomarkers that allow detection of the onset of disease are of high interest since early detection would allow intervening with lifestyle and nutritional changes before the disease is manifested and pharmacological therapy is required. Our study aimed to improve the phenotypic characterization of overweight but apparently healthy subjects and to identify new candidate profiles for early biomarkers of obesity-related diseases such as cardiovascular disease and type 2 diabetes.Methodology/Principal FindingsIn a population of 56 healthy, middle-aged overweight subjects Body Mass Index (BMI), fasting concentration of 124 plasma proteins and insulin were determined. The plasma proteins are implicated in chronic diseases, inflammation, endothelial function and metabolic signaling. Random Forest was applied to select proteins associated with BMI and plasma insulin. Subsequently, the selected proteins were analyzed by clustering methods to identify protein clusters associated with BMI and plasma insulin. Similar analyses were performed for a second population of 20 healthy, overweight older subjects to verify associations found in population I. In both populations similar clusters of proteins associated with BMI or insulin were identified. Leptin and a number of pro-inflammatory proteins, previously identified as possible biomarkers for obesity-related disease, e.g. Complement 3, C Reactive Protein, Serum Amyloid P, Vascular Endothelial Growth Factor clustered together and were positively associated with BMI and insulin. IL-3 and IL-13 clustered together with Apolipoprotein A1 and were inversely associated with BMI and might be potential new biomarkers.Conclusion/ SignificanceWe identified clusters of plasma proteins associated with BMI and insulin in healthy populations. These clusters included previously reported biomarkers for obesity-related disease and potential new biomarkers such as IL-3 and IL-13. These plasma protein clusters could have potential applications for improved phenotypic characterization of volunteers in nutritional intervention studies or as biomarkers in the early detection of obesity-linked disease development and progression.
To discriminate between breast cancer patients and controls, we used a three-step approach to obtain our decision rule. First, we ranked the mass/charge values using random forests, because it generates importance indices that take possible interactions into account. We observed that the top ranked variables consisted of highly correlated contiguous mass/charge values, which were grouped in the second step into new variables. Finally, these newly created variables were used as predictors to find a suitable discrimination rule. In this last step, we compared three different methods, namely Classification and Regression Tree (CART), logistic regression and penalized logistic regression. Logistic regression and penalized logistic regression performed equally well and both had a higher classification accuracy than CART. The model obtained with penalized logistic regression was chosen as we hypothesized that this model would provide a better classification accuracy in the validation set. The solution had a good performance on the training set with a classification accuracy of 86.3%, and a sensitivity and specificity of 86.8% and 85.7%, respectively.
In this study, we applied the multivariate statistical tool Partial Least Squares (PLS) to analyze the relative importance of 83 plasma proteins in relation to coronary heart disease (CHD) mortality and the intermediate end points body mass index, HDL-cholesterol and total cholesterol. From a Dutch monitoring project for cardiovascular disease risk factors, men who died of CHD between initial participation (1987-1991) and end of follow-up (January 1, 2000) (N = 44) and matched controls (N = 44) were selected. Baseline plasma concentrations of proteins were measured by a multiplex immunoassay. With the use of PLS, we identified 15 proteins with prognostic value for CHD mortality and sets of proteins associated with the intermediate end points. Subsequently, sets of proteins and intermediate end points were analyzed together by Principal Components Analysis, indicating that proteins involved in inflammation explained most of the variance, followed by proteins involved in metabolism and proteins associated with total-C. This study is one of the first in which the association of a large number of plasma proteins with CHD mortality and intermediate end points is investigated by applying multivariate statistics, providing insight in the relationships among proteins, intermediate end points and CHD mortality, and a set of proteins with prognostic value.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.