Data Science through the looking glass and what we found there

Psallidas, Fotis; Zhu, Yunping; Karlaš, Bojan; Interlandi, Matteo; Floratou, Avrilia; Karanasos, Konstantinos; Wu, Wentao; Zhang, Ce; Krishnan, Subru; Curino, Carlo; Weimer, Markus

doi:10.48550/arxiv.1912.09536

Cited by 7 publications

(15 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Model cards are still scarcely adopted in practice. In all of GitHub with millions of public notebooks [31,33,37] and many repositories sharing learning code and learned models, we found only 24 models documented explicitly with model cards. Our best effort on finding model cards published by companies results in only 28 models.…”

Section: Discussionmentioning

confidence: 99%

Aspirations and Practice of Model Documentation: Moving the Needle with Nudging and Traceability

Bhat¹,

Coursey²,

Hu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Machine learning models have been widely developed, released, and adopted in numerous applications. Meanwhile, the documentation practice for machine learning models often falls short of established practices for traditional software components, which impedes model accountability, inadvertently abets inappropriate or misuse of models, and may trigger negative social impact. Recently, model cards, a template for documenting machine learning models, have attracted notable attention, but their impact on the practice of model documentation is unclear. In this work, we examine publicly available model cards and other similar documentation. Our analysis reveals a substantial gap between the suggestions made in the original model card work and the content in actual documentation.Motivated by this observation and literature on fields such as software documentation, interaction design, and traceability, we further propose a set of design guidelines that aim to support the documentation practice for machine learning models including (1) the collocation of documentation environment with the coding environment, (2) nudging the consideration of model card sections during model development, and (3) documentation derived from and traced to the source. We designed a prototype tool named DocML following those guidelines to support model development in computational notebooks. A lab study reveals the benefit of our tool to shift the behavior of data scientists towards documentation quality and accountability.

show abstract

Section: Discussionmentioning

confidence: 99%

Aspirations and Practice of Model Documentation: Moving the Needle with Nudging and Traceability

Bhat¹,

Coursey²,

Hu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…To do this, we analyzed over 480,000 pipelines with more than 920,000 operators, the result of searches carried out under a 5 minute execution time budget. For purposes of our analysis, we define an operator to be a single step in the pipeline, which can correspond to a data transformer or a predictor (a distinction presented in [49]). We calculate the amount of operators per pipeline for different sampling ratios, analyzing all the pipelines evaluated during the search procedure to account for changes during evolutions.…”

Section: Rq4: Pipeline Characteristicsmentioning

confidence: 99%

“…For example, when we use a downsampling ratio of 0.0001 the average pipeline has 1.85 (0.30 sd) operators, while a full dataset results in an average pipeline with 1.60 (0.12 sd) operators. For context, a recent large scale pipeline analysis by Psallidas et al [49] found that most user-implemented scikit-learn (TPOT's target API) pipelines consist of 1 -4 operators.…”

Section: Rq4: Pipeline Characteristicsmentioning

confidence: 99%

“…Building and tuning well performing machine learning systems is a difficult task that benefits from domain and specialized data science knowledge [11,14,29,61]. Developing a machine learning pipeline requires users to identify the relevant algorithms, decide how to compose these, choose the key hyperparameters and their values for each algorithm, and then implement this pipeline (typically) using a third-party library [15,49]. To further increase the This work is licensed under the Creative Commons BY-NC-ND 4.0 International License.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Doing more with less

et al. 2021

View full text Add to dashboard Cite

Automated machine learning (AutoML) promises to democratize machine learning by automatically generating machine learning pipelines with little to no user intervention. Typically, a search procedure is used to repeatedly generate and validate candidate pipelines, maximizing a predictive performance metric, subject to a limited execution time budget. While this approach to generating candidates works well for small tabular datasets, the same procedure does not directly scale to larger tabular datasets with 100,000s of observations, often producing fewer candidate pipelines and yielding lower performance, given the same execution time budget. We carry out an extensive empirical evaluation of the impact that downsampling - reducing the number of rows in the input tabular dataset - has on the pipelines produced by a genetic-programming-based AutoML search for classification tasks.

show abstract

“…Studies through empirical code analysis and qualitative studies offer different lenses into studying human-centered practices in developing ML workflows. Psallidas et al [18] analyzed publicly-available computational notebooks and enterprise data science code and pipelines to illustrate growing trends and usage behavior of data science tools. Other studies have employed qualitative, semi-structured interviews to study how different groups of users engage with ML development, including how software engineers [2] and non-experts [25] develop ML-based applications, and how ML practitioners iterate on their data in ML development [11].…”

Section: Related Workmentioning

confidence: 99%

Demystifying a Dark Art: Understanding Real-World Machine Learning Model Development

Lee,

Xin,

Lee

et al. 2020

Preprint

View full text Add to dashboard Cite

It is well-known that the process of developing machine learning (ML) workflows is a dark-art; even experts struggle to find an optimal workflow leading to a high accuracy model. Users currently rely on empirical trial-and-error to obtain their own set of battle-tested guidelines to inform their modeling decisions. In this study, we aim to demystify this dark art by understanding how people iterate on ML workflows in practice. We analyze over 475k user-generated workflows on OpenML, an open-source platform for tracking and sharing ML workflows. We find that users often adopt a manual, automated, or mixed approach when iterating on their workflows. We observe that manual approaches result in fewer wasted iterations compared to automated approaches. Yet, automated approaches often involve more preprocessing and hyperparameter options explored, resulting in higher performance overall-suggesting potential benefits for a human-in-the-loop ML system that appropriately recommends a clever combination of the two strategies.

show abstract

Data Science through the looking glass and what we found there

Cited by 7 publications

References 0 publications

Aspirations and Practice of Model Documentation: Moving the Needle with Nudging and Traceability

Aspirations and Practice of Model Documentation: Moving the Needle with Nudging and Traceability

Doing more with less

Demystifying a Dark Art: Understanding Real-World Machine Learning Model Development

Contact Info

Product

Resources

About