Automating Data Science

Brazdil, Pavel; Rijn, Jan N. van; Soares, Carlos; Vanschoren, Joaquin

doi:10.1007/978-3-030-67024-5_14

Cited by 7 publications

(15 citation statements)

References 31 publications

(18 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For instance, does it make sense to smooth learning curves for model selection or curve extrapolation? Can meta-features [2], like the number of instances, be used to reliably predict (i) whether curves intersect, (ii) if they are monotone or convex, (iii) or which curve model will be accurate? Which nonparametric extrapolation techniques work best and in what case?…”

Section: Discussionmentioning

confidence: 99%

“…For example, they can be extrapolated to determine the value of gathering more data or can be used to speed up training by selecting a smaller dataset size that still reaches sufficient accuracy. In addition, learning curves can provide useful information for model selection [2,26]. Particularly important questions concern the performance in the limit and the training set size at which the learning curves of two algorithms cross, as this can tell us when one learning algorithm should be preferred over the other.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

LCDB 1.0: An Extensive Learning Curves Database for Classification Tasks

Mohr

Viering

Loog

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

The use of learning curves for decision making in supervised machine learning is standard practice, yet understanding of their behavior is rather limited. To facilitate a deepening of our knowledge, we introduce the Learning Curve Database (LCDB), which contains empirical learning curves of 20 classification algorithms on 246 datasets. One of the LCDB's unique strength is that it contains all (probabilistic) predictions, which allows for building learning curves of arbitrary metrics. Moreover, it unifies the properties of similar high quality databases in that it (i) defines clean splits between training, validation, and test data, (ii) provides training times, and (iii) provides an API for convenient access (pip install lcdb). We demonstrate the utility of LCDB by analyzing some learning curve phenomena, such as convexity, monotonicity, peaking, and curve shapes. Improving our understanding of these matters is essential for efficient use of learning curves for model selection, speeding up model training, and to determine the value of more training data.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

LCDB 1.0: An Extensive Learning Curves Database for Classification Tasks

Mohr

Viering

Loog

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…A good alternative to select the best ML algorithm for a new dataset is to use previous knowledge regarding the performance of a set of algorithms in previous learning experiences. This is the idea behind a particular approach for metalearning, defined in [ 36 ] as learning to learn. According to the authors, metelarning is a research area that investigates how to recommend the most suitable algorithm, or set of algorithms, for a new task.…”

Section: Metalearningmentioning

confidence: 99%

BioAutoML: automated feature engineering and metalearning to predict noncoding RNAs in bacteria

Bonidia

Santos

Almeida

et al. 2022

Briefings in Bioinformatics

View full text Add to dashboard Cite

Recent technological advances have led to an exponential expansion of biological sequence data and extraction of meaningful information through Machine Learning (ML) algorithms. This knowledge has improved the understanding of mechanisms related to several fatal diseases, e.g. Cancer and coronavirus disease 2019, helping to develop innovative solutions, such as CRISPR-based gene editing, coronavirus vaccine and precision medicine. These advances benefit our society and economy, directly impacting people’s lives in various areas, such as health care, drug discovery, forensic analysis and food processing. Nevertheless, ML-based approaches to biological data require representative, quantitative and informative features. Many ML algorithms can handle only numerical data, and therefore sequences need to be translated into a numerical feature vector. This process, known as feature extraction, is a fundamental step for developing high-quality ML-based models in bioinformatics, by allowing the feature engineering stage, with design and selection of suitable features. Feature engineering, ML algorithm selection and hyperparameter tuning are often manual and time-consuming processes, requiring extensive domain knowledge. To deal with this problem, we present a new package: BioAutoML. BioAutoML automatically runs an end-to-end ML pipeline, extracting numerical and informative features from biological sequence databases, using the MathFeature package, and automating the feature selection, ML algorithm(s) recommendation and tuning of the selected algorithm(s) hyperparameters, using Automated ML (AutoML). BioAutoML has two components, divided into four modules: (1) automated feature engineering (feature extraction and selection modules) and (2) Metalearning (algorithm recommendation and hyper-parameter tuning modules). We experimentally evaluate BioAutoML in two different scenarios: (i) prediction of the three main classes of noncoding RNAs (ncRNAs) and (ii) prediction of the eight categories of ncRNAs in bacteria, including housekeeping and regulatory types. To assess BioAutoML predictive performance, it is experimentally compared with two other AutoML tools (RECIPE and TPOT). According to the experimental results, BioAutoML can accelerate new studies, reducing the cost of feature engineering processing and either keeping or improving predictive performance. BioAutoML is freely available at https://github.com/Bonidia/BioAutoML.

show abstract

“…To enable knowledge sharing across data sets, the scientific community has developed methods commonly referred to as meta-learning. 19 Whereas traditional machine learning models typically require an abundance of labeled data, meta-learning attempts to address this issue by asking how to learn to learn tasks? For this, meta-learning borrows intuition from how humans learn and solve problems.…”

Section: ■ Introductionmentioning

confidence: 99%

“…Instead of learning each task independently and anew, humans approach each challenge with prior knowledge. 19,20 With the success of transfer learning techniques in natural language processing or image analysis, its potential use in QSAR modeling has been recognized. 21,22 We believe the use of these techniques could be beneficial in utilizing and predicting the many low-resource tasks inherent to aquatic toxicity.…”

Section: ■ Introductionmentioning

confidence: 99%

The Bigger Fish: A Comparison of Meta-Learning QSAR Models on Low-Resourced Aquatic Toxicity Regression Tasks

Schlender

Viljanen

Rijn

et al. 2023

Environ. Sci. Technol.

Self Cite

View full text Add to dashboard Cite

Toxicological information as needed for risk assessments of chemical compounds is often sparse. Unfortunately, gathering new toxicological information experimentally often involves animal testing. Simulated alternatives, e.g., quantitative structure–activity relationship (QSAR) models, are preferred to infer the toxicity of new compounds. Aquatic toxicity data collections consist of many related taskseach predicting the toxicity of new compounds on a given species. Since many of these tasks are inherently low-resource, i.e., involve few associated compounds, this is challenging. Meta-learning is a subfield of artificial intelligence that can lead to more accurate models by enabling the utilization of information across tasks. In our work, we benchmark various state-of-the-art meta-learning techniques for building QSAR models, focusing on knowledge sharing between species. Specifically, we employ and compare transformational machine learning, model-agnostic meta-learning, fine-tuning, and multi-task models. Our experiments show that established knowledge-sharing techniques outperform single-task approaches. We recommend the use of multi-task random forest models for aquatic toxicity modeling, which matched or exceeded the performance of other approaches and robustly produced good results in the low-resource settings we studied. This model functions on a species level, predicting toxicity for multiple species across various phyla, with flexible exposure duration and on a large chemical applicability domain.

show abstract

Automating Data Science

Cited by 7 publications

References 31 publications

LCDB 1.0: An Extensive Learning Curves Database for Classification Tasks

LCDB 1.0: An Extensive Learning Curves Database for Classification Tasks

BioAutoML: automated feature engineering and metalearning to predict noncoding RNAs in bacteria

The Bigger Fish: A Comparison of Meta-Learning QSAR Models on Low-Resourced Aquatic Toxicity Regression Tasks

Contact Info

Product

Resources

About