Willem Waegeman scite author profile

The notion of uncertainty is of major importance in machine learning and constitutes a key element of machine learning methodology. In line with the statistical tradition, uncertainty has long been perceived as almost synonymous with standard probability and probabilistic predictions. Yet, due to the steadily increasing relevance of machine learning for practical applications and related issues such as safety requirements, new problems and challenges have recently been identified by machine learning scholars, and these problems may call for new methodological developments. In particular, this includes the importance of distinguishing between (at least) two different types of uncertainty, often referred to as aleatoric and epistemic. In this paper, we provide an introduction to the topic of uncertainty in machine learning as well as an overview of attempts so far at handling uncertainty in general and formalizing this distinction in particular.

show abstract

On label dependence and loss minimization in multi-label classification

Dembczyński

et al. 2012

View full text Add to dashboard Cite

Most of the multi-label classification (MLC) methods proposed in recent years intended to exploit, in one way or the other, dependencies between the class labels. Comparing to simple binary relevance learning as a baseline, any gain in performance is normally explained by the fact that this method is ignoring such dependencies. Without questioning the correctness of such studies, one has to admit that a blanket explanation of that kind is hiding many subtle details, and indeed, the underlying mechanisms and true reasons for the improvements reported in experimental studies are rarely laid bare. Rather than proposing yet another MLC algorithm, the aim of this paper is to elaborate more closely on the idea of exploiting label dependence, thereby contributing to a better understanding of MLC. Adopting a statistical perspective, we claim that two types of label dependence should be distinguished, namely conditional and marginal dependence. Subsequently, we present three scenarios in which the exploitation of one of these types of dependence may boost the predictive performance of a classifier. In this regard, a close connection with loss minimization is established, showing that the benefit of exploiting label dependence does also depend on the type of loss to be minimized. Concrete theoretical results are presented for two repre

show abstract

Absolute quantification of microbial taxon abundances

et al. 2016

View full text Add to dashboard Cite

High-throughput amplicon sequencing has become a well-established approach for microbial community profiling. Correlating shifts in the relative abundances of bacterial taxa with environmental gradients is the goal of many microbiome surveys. As the abundances generated by this technology are semi-quantitative by definition, the observed dynamics may not accurately reflect those of the actual taxon densities. We combined the sequencing approach (16S rRNA gene) with robust single-cell enumeration technologies (flow cytometry) to quantify the absolute taxon abundances. A detailed longitudinal analysis of the absolute abundances resulted in distinct abundance profiles that were less ambiguous and expressed in units that can be directly compared across studies. We further provide evidence that the enrichment of taxa (increase in relative abundance) does not necessarily relate to the outgrowth of taxa (increase in absolute abundance). Our results highlight that both relative and absolute abundances should be considered for a comprehensive biological interpretation of microbiome surveys. The ISME Journal (2017) 11, 584-587; doi:10.1038/ismej.2016 published online 9 September 2016 Recent advancements in high-throughput sequencing of marker genes, such as the 16S rRNA gene, have provided microbial ecologists the tools to accurately infer the relative composition of microbial communities (Franzosa et al., 2015). This resulted in a widespread application of the technology in longitudinal studies where shifts in community structure are related to environmental variables and functional outputs (Faust et al., 2015;Wilhelm et al., 2015). An inherent limitation of the sequencing technology is that the calculated taxon abundances comprise relative values (Widder et al., 2016). Hence, caution must be taken with the biological interpretation of these values, since inter-sample differences in cell density are not considered. To our knowledge, there are no descriptive studies that assess the extent to which relative abundances deliver a skewed image of the actual microbial community dynamics. In this study, we combined robust cell density measurements from flow cytometry (Prest et al., 2013;Van Nevel et al., 2013) with the relative abundances derived from 16S rRNA gene amplicon sequencing. We performed two extensive longitudinal surveys on the central water reservoir of a cooling water system. This engineered freshwater ecosystem was subjected to highly controlled operational phases (Supplementary Information and data set). We quantified the absolute taxon abundances and assessed whether additional insights could be attained with the combined approach.Based on the sample-specified total cell density, the absolute taxon abundances were calculated for each time point. Individual taxon densities ranged from 0.5 to 1 679 cells per μl. Several inter-taxon differences became apparent by performing ordinary least squares regression analysis between the relative and absolute abundances. We focused on the three most abundant taxa, which ...

show abstract

Bacterial species identification from MALDI-TOF mass spectra through data analysis and machine learning

Bruyne

Slabbinck

Waegeman

et al. 2011

Systematic and Applied Microbiology

182

122

View full text Add to dashboard Cite

Vegetation anomalies caused by antecedent precipitation in most of the world

Papagiannopoulou

Miralles

Dorigo

et al. 2017

Environ. Res. Lett.

120

100

View full text Add to dashboard Cite

Quantifying environmental controls on vegetation is critical to predict the net effect of climate change on global ecosystems and the subsequent feedback on climate. Following a non-linear Granger causality framework based on a random forest predictive model, we exploit the current wealth of multi-decadal satellite data records to uncover the main drivers of monthly vegetation variability at the global scale. Results indicate that water availability is the most dominant factor driving vegetation globally: about 61% of the vegetated surface was primarily water-limited during 1981-2010. This included semiarid climates but also transitional ecoregions. Intraannually, temperature controls Northern Hemisphere deciduous forests during the growing season, while antecedent precipitation largely dominates vegetation dynamics during the senescence period. The uncovered dependency of global vegetation on water availability is substantially larger than previously reported. This is owed to the ability of the framework to (1) disentangle the co-linearities between radiation/temperature and precipitation, and (2) quantify non-linear impacts of climate on vegetation. Our results reveal a prolonged effect of precipitation anomalies in dry regions: due to the long memory of soil moisture and the cumulative, nonlinear, response of vegetation, water-limited regions show sensitivity to the values of precipitation occurring three months earlier. Meanwhile, the impacts of temperature and radiation anomalies are more immediate and dissipate shortly, pointing to a higher resilience of vegetation to these anomalies. Despite being infrequent by definition, hydro-climatic extremes are responsible for up to 10% of the vegetation variability during the 1981-2010 period in certain areas, particularly in water-limited ecosystems. Our approach is a first step towards a quantitative comparison of the resistance and resilience signature of different ecosystems, and can be used to benchmark Earth system models in their representations of past vegetation sensitivity to changes in climate.

show abstract

An experimental comparison of cross-validation techniques for estimating the area under the ROC curve

Airola

Pahikkala

Waegeman

et al. 2011

Computational Statistics & Data Analysis

126

103

View full text Add to dashboard Cite

A non-linear Granger-causality framework to investigate climate–vegetation dynamics

et al. 2017

View full text Add to dashboard Cite

Abstract. Satellite Earth observation has led to the creation of global climate data records of many important environmental and climatic variables. These come in the form of multivariate time series with different spatial and temporal resolutions. Data of this kind provide new means to further unravel the influence of climate on vegetation dynamics. However, as advocated in this article, commonly used statistical methods are often too simplistic to represent complex climate-vegetation relationships due to linearity assumptions. Therefore, as an extension of linear Granger-causality analysis, we present a novel non-linear framework consisting of several components, such as data collection from various databases, time series decomposition techniques, feature construction methods, and predictive modelling by means of random forests. Experimental results on global data sets indicate that, with this framework, it is possible to detect non-linear patterns that are much less visible with traditional Granger-causality methods. In addition, we discuss extensive experimental results that highlight the importance of considering non-linear aspects of climate-vegetation dynamics.

show abstract

Habitat prediction and knowledge extraction for spawning European grayling (Thymallus thymallus L.) using a broad range of species distribution models

Fukuda

Baets

Waegeman

et al. 2013

Environmental Modelling & Software

118

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Willem Waegeman

Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods

On label dependence and loss minimization in multi-label classification

Absolute quantification of microbial taxon abundances

Bacterial species identification from MALDI-TOF mass spectra through data analysis and machine learning

Vegetation anomalies caused by antecedent precipitation in most of the world

An experimental comparison of cross-validation techniques for estimating the area under the ROC curve

A non-linear Granger-causality framework to investigate climate–vegetation dynamics

Habitat prediction and knowledge extraction for spawning European grayling (Thymallus thymallus L.) using a broad range of species distribution models

Contact Info

Product

Resources

About