We provide a detailed hands-on tutorial for the R add-on package mboost. The package implements boosting for optimizing general risk functions utilizing component-wise (penalized) least squares estimates as base-learners for fitting various kinds of generalized linear and generalized additive models to potentially high-dimensional data. We give a theoretical background and demonstrate how mboost can be used to fit interpretable models of different complexity.As an example we use mboost to predict the body fat based on anthropometric measurements throughout the tutorial.
Generalized additive models for location, scale and shape (GAMLSSs) are a popular semiparametric modelling approach that, in contrast with conventional generalized additive models, regress not only the expected mean but also every distribution parameter (e.g. location, scale and shape) to a set of covariates. Current fitting procedures for GAMLSSs are infeasible for high dimensional data set-ups and require variable selection based on (potentially problematic) information criteria. The present work describes a boosting algorithm for high dimensional GAMLSSs that was developed to overcome these limitations. Specifically, the new algorithm was designed to allow the simultaneous estimation of predictor effects and variable selection. The algorithm proposed was applied to Munich rental guide data, which are used by landlords and tenants as a reference for the average rent of a flat depending on its characteristics and spatial features. The net rent predictions that resulted from the high dimensional GAMLSSs were found to be highly competitive and covariate-specific prediction intervals showed a major improvement over classical generalized additive models.
johannes.sikorski@dsmz.de.
BackgroundModern biotechnologies often result in high-dimensional data sets with many more variables than observations (n≪p). These data sets pose new challenges to statistical analysis: Variable selection becomes one of the most important tasks in this setting. Similar challenges arise if in modern data sets from observational studies, e.g., in ecology, where flexible, non-linear models are fitted to high-dimensional data. We assess the recently proposed flexible framework for variable selection called stability selection. By the use of resampling procedures, stability selection adds a finite sample error control to high-dimensional variable selection procedures such as Lasso or boosting. We consider the combination of boosting and stability selection and present results from a detailed simulation study that provide insights into the usefulness of this combination. The interpretation of the used error bounds is elaborated and insights for practical data analysis are given.ResultsStability selection with boosting was able to detect influential predictors in high-dimensional settings while controlling the given error bound in various simulation scenarios. The dependence on various parameters such as the sample size, the number of truly influential variables or tuning parameters of the algorithm was investigated. The results were applied to investigate phenotype measurements in patients with autism spectrum disorders using a log-linear interaction model which was fitted by boosting. Stability selection identified five differentially expressed amino acid pathways.ConclusionStability selection is implemented in the freely available R package stabs (http://CRAN.R-project.org/package=stabs). It proved to work well in high-dimensional settings with more predictors than observations for both, linear and additive models. The original version of stability selection, which controls the per-family error rate, is quite conservative, though, this is much less the case for its improvement, complementary pairs stability selection. Nevertheless, care should be taken to appropriately specify the error bound.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0575-3) contains supplementary material, which is available to authorized users.
Master protocols have received a growing interest during the last years. By assigning patients to specific substudies, they aim at targeting and accelerating clinical development. Given their complexity, basket, umbrella, and platform designs have raised challenging regulatory and statistical questions, especially the control of multiplicity in confirmatory trials. In basket trials, regulatory assessment of the benefit/risk in pooled populations and choice of the treatment indication is challenging. We provide here our perspectives on these topics. In master protocols, as long as the statistical hypotheses tested between the different substudies are independent, no supplementary adjustment for multiplicity over the different substudies should be required. Moreover, sharing a control arm within an umbrella or a platform trial investigating different drugs would not require a correction for the type I error rate, whereas the chance of multiple false positive regulatory decisions should be recognized. In basket trials, pooling across substudies requires a rationale supporting the intended indication and should be preplanned. Assessment of the benefit/risk in pooled target populations can be complicated by differences in design or in efficacy/safety signals between the substudies. While trials governed by a master protocol can offer logistic and financial advantages, more experience is needed to gain a deeper insight into this novel framework.
Variable selection and model choice are of major concern in many statistical applications, especially in high-dimensional regression models. Boosting is a convenient statistical method that combines model fitting with intrinsic model selection. We investigate the impact of base-learner specification on the performance of boosting as a model selection procedure. We show that variable selection may be biased if the covariates are of different nature. Important examples are models combining continuous and categorical covariates, especially if the number of categories is large. In this case, least squares base-learners offer increased flexibility for the categorical covariate and lead to a preference even if the categorical covariate is noninformative. Similar difficulties arise when comparing linear and nonlinear base-learners for a continuous covariate. The additional flexibility in the nonlinear base-learner again yields a preference of the more complex modeling alternative. We investigate these problems from a theoretical perspective and suggest a framework for unbiased model selection based on a general class of penalized least squares base-learners. Making all base-learners comparable in terms of their degrees of freedom strongly reduces the selection bias observed in naive boosting specifications. The importance of unbiased model selection is demonstrated in simulations and an application to forest health models.
The gathering of clinical data on fractures of dental restorations through prospective clinical trials is a labor-and time-consuming enterprise. Here, we propose an unconventional approach for collecting large datasets, from which clinical information on indirect restorations can be retrospectively analyzed. The authors accessed the database of an industry-scale machining center in Germany and obtained information on 34,911 computer-aided design (CAD)/computer-aided manufacturing (CAM) all-ceramic posterior restorations. The fractures of bridges, crowns, onlays, and inlays fabricated from different all-ceramic systems over a period of 3.5 y were reported by dentists and entered in the database. Survival analyses and estimations of future life revealed differences in performance among ZrO 2 -based restorations and lithium disilicate and leucite-reinforced glass-ceramics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.