2021
DOI: 10.3390/info12100392
|View full text |Cite
|
Sign up to set email alerts
|

Understanding Collections of Related Datasets Using Dependent MMD Coresets

Abstract: Understanding how two datasets differ can help us determine whether one dataset under-represents certain sub-populations, and provides insights into how well models will generalize across datasets. Representative points selected by a maximum mean discrepancy (MMD) coreset can provide interpretable summaries of a single dataset, but are not easily compared across datasets. In this paper, we introduce dependent MMD coresets, a data summarization method for collections of datasets that facilitates comparison of d… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
1
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 28 publications
0
1
0
Order By: Relevance
“…Second, the technical work of data science projects is typically approached as the 'one-off application' of a statistical model to a given dataset (Polyzotis et al 2017). Built on an assumption of a 'largely stable world' (Marcus 2018), data science often views changes to data (or its underlying distribution), called 'data drift', as detrimental to model performance (Hohman et al 2020;Hoens et al 2012;Williamson and Henderson 2021). For data science activities to be sustainable, data needs to be managed to account for changes to data in the dynamic world (e.g., Amershi et al 2019b;Bopp et al 2017).…”
Section: Sustaining Data Science Activities By Domain Experts As the ...mentioning
confidence: 99%
“…Second, the technical work of data science projects is typically approached as the 'one-off application' of a statistical model to a given dataset (Polyzotis et al 2017). Built on an assumption of a 'largely stable world' (Marcus 2018), data science often views changes to data (or its underlying distribution), called 'data drift', as detrimental to model performance (Hohman et al 2020;Hoens et al 2012;Williamson and Henderson 2021). For data science activities to be sustainable, data needs to be managed to account for changes to data in the dynamic world (e.g., Amershi et al 2019b;Bopp et al 2017).…”
Section: Sustaining Data Science Activities By Domain Experts As the ...mentioning
confidence: 99%
“…Second, the technical work of data science projects is typically approached as the 'one-off application' of a statistical model to a given dataset (Polyzotis et al 2017). Built on an assumption of a 'largely stable world' (Marcus 2018), data science often views changes to data (or its underlying distribution), called 'data drift', as detrimental to model performance (Hohman et al 2020;Hoens et al 2012;Williamson and Henderson 2021). For data science activities to be sustainable, data needs to be managed to account for changes to data in the dynamic world (e.g., Amershi et al 2019b;Bopp et al 2017).…”
Section: Sustaining Data Science Activities By Domain Experts As the ...mentioning
confidence: 99%