Smallset Timelines: A Visual Representation of Data Preprocessing Decisions

Lucchesi, Lydia R.; Kuhnert, Petra M.; Davis, Jenny; Xie, Lexing

doi:10.1145/3531146.3533175

Cited by 26 publications

(7 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To bridge the expertise gap, systems should be designed to aid domain experts in understanding data science techniques in a manner that better matches their mental model of data [21]. Visualizations of data changes [17,24,32] have proved to be an effective means of supporting understanding beyond serving the function of documentation. Recent work in Explainable AI [7,9,11,18,50] has also demonstrated ways to make ML components more interpretable to non-experts through visualization and direct manipulation interactions [2,29,39].…”

Section: Related Workmentioning

confidence: 99%

“…Informed by the challenges and potential solutions identified in prior research, we explore the use of LLMs to supplement human effort in translating code into natural language and explainable visualizations. In CellSync, we build on the Smallset Timelines visualization technique [24] to select and visualize a digestible subset of rows and columns for domain experts. Further, we utilize LLMs' code summarization capabilities [1] and refine these summaries through targeted prompt engineering to tailor them towards domain experts.…”

Section: Related Workmentioning

confidence: 99%

“…Both the extension and the dashboard provide a chat feature for domain experts (Figure 1d) and data scientists (Figure 2) to exchange comments. Each SnapGrid, inspired by Smallset Timelines [24] which visualizes data changes in grid-based snapshots, shows a static 9-by-9 subset of the data; the subset is selected to maximize the coverage of the changes tied to the data operation. On each square that represents a cell in the dataframe, the cell's previous and new values are displayed with an arrow between them.…”

Section: User Experiencementioning

confidence: 99%

“…4.1.2 Computing Subset for SnapGrid. Building upon the Smallset Timelines subset computation [24], we devise an algorithm to select a subset of 9 rows by 9 columns that maximizes the coverage of value changes for SnapGrid. The algorithm selects the dataset rows with the highest number of changed values, and the columns that have been directly affected or are implicitly involved in column changes (e.g.…”

Section: Natural Language Code Descriptionmentioning

confidence: 99%

See 3 more Smart Citations

Leveraging Large Language Models to Enhance Domain Expert Inclusion in Data Science Workflows

Shih,

Mohanty,

Katsis

et al. 2024

Extended Abstracts of the CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

Domain experts can play a crucial role in guiding data scientists to optimize machine learning models while ensuring contextual relevance for downstream use. However, in current workflows, such collaboration is challenging due to differing expertise, abstract documentation practices, and lack of access and visibility into low-level implementation artifacts. To address these challenges and enable domain expert participation, we introduce CellSync, a collaboration framework comprising (1) a Jupyter Notebook extension that continuously tracks changes to dataframes and model metrics and (2) a Large Language Model powered visualization dashboard that makes those changes interpretable to domain experts. Through CellSync's cell-level dataset visualization with code summaries, domain experts can interactively examine how individual data and modeling operations impact different data segments. The chat features enable data-centric conversations and targeted feedback to data scientists. Our preliminary evaluation shows that CellSync provides transparency and promotes critical discussions about the intents and implications of data operations. CCS CONCEPTS• Human-centered computing → User interface programming; Collaborative and social computing systems and tools; Information visualization.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: User Experiencementioning

confidence: 99%

Section: Natural Language Code Descriptionmentioning

confidence: 99%

See 2 more Smart Citations

Leveraging Large Language Models to Enhance Domain Expert Inclusion in Data Science Workflows

Shih,

Mohanty,

Katsis

et al. 2024

Extended Abstracts of the CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

show abstract

“…Standardised description of specific preprocessing steps is challenging due to the wide variety of possible data alterations. Moreover, as observed by Lucchesi et al (2022), definitions of data preprocessing vary with audience and context from highly specific lists of tasks, to broadly encompassing boundaries within a longer data pipeline. Existing provenance tools such as (Lucchesi et al 2022;Kai Xiong et al 2022;Wang et al 2022) attempt to achieve generality by comparing dataset snapshots at various points in a preprocessing pipeline.…”

Section: Data Provenancementioning

confidence: 99%

A Unified Framework for Specification Tests of Continuous Treatment Effect Models

Huang

Zhang

2021

Journal of Business & Economic Statistics

View full text Add to dashboard Cite

Ex-post harmonisation is one of many data preprocessing processes used to combine the increasingly vast and diverse sources of data available for research and analysis. Documenting provenance and ensuring the quality of multi-source datasets is vital for ensuring trustworthy scientific research and encouraging reuse of existing harmonisation efforts. However, capturing and communicating statistically relevant properties of harmonised datasets is difficult without a universal standard for describing harmonisation operations. Our paper combines mathematical and computer science perspectives to address this need. The Crossmaps Framework defines a new approach for transforming existing variables collected under a specific measurement or classification standard to an imputed counterfactual variable indexed by some target standard. It uses computational graphs to separate intended transformation logic from actual data transformations, and avoid the risk of syntactically valid data manipulation scripts resulting in statistically questionable data. In this paper, we introduce the Crossmaps Framework through the example of ex-post harmonisation of aggregated statistics in the social sciences. We define a new provenance task abstraction, the crossmap transform, and formalise two associated objects, the shared mass array and the crossmap. We further define graph, matrix and list encodings of crossmaps and discuss resulting implications for understanding statistical properties of ex-post harmonisation and designing error minimising workflows.

show abstract

Preregistration and Registered Reports in Sociology: Strengths, Weaknesses, and Other Considerations

Manago

2023

Am Soc

View full text Add to dashboard Cite

Both within and outside of sociology, there are conversations about methods to reduce error and improve research quality—one such method is preregistration and its counterpart, registered reports. Preregistration is the process of detailing research questions, variables, analysis plans, etc. before conducting research. Registered reports take this one step further, with a paper being reviewed on the merit of these plans, not its findings. In this manuscript, I detail preregistration’s and registered reports’ strengths and weaknesses for improving the quality of sociological research. I conclude by considering the implications of a structural-level adoption of preregistration and registered reports. Importantly, I do not recommend that all sociologists use preregistration and registered reports for all studies. Rather, I discuss the potential benefits and genuine limitations of preregistration and registered reports for the individual sociologist and the discipline.

show abstract

Smallset Timelines: A Visual Representation of Data Preprocessing Decisions

Cited by 26 publications

References 35 publications

Leveraging Large Language Models to Enhance Domain Expert Inclusion in Data Science Workflows

Leveraging Large Language Models to Enhance Domain Expert Inclusion in Data Science Workflows

A Unified Framework for Specification Tests of Continuous Treatment Effect Models

Preregistration and Registered Reports in Sociology: Strengths, Weaknesses, and Other Considerations

Contact Info

Product

Resources

About