2020
DOI: 10.48550/arxiv.2006.16923
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Large image datasets: A pyrrhic win for computer vision?

Abstract: In this paper we investigate problematic practices and consequences of large scale vision datasets. We examine broad issues such as the question of consent and justice as well as specific concerns such as the inclusion of verifiably pornographic images in datasets. Taking the ImageNet-ILSVRC-2012 dataset as an example, we perform a cross-sectional model-based quantitative census covering factors such as age, gender, NSFW content scoring, class-wise accuracy, human-cardinality-analysis, and the semanticity of t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
17
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
4

Relationship

0
10

Authors

Journals

citations
Cited by 22 publications
(23 citation statements)
references
References 36 publications
0
17
0
Order By: Relevance
“…1 The images were subjected to an array of automated filters designed to remove potentially offensive content. While certainly not perfect, this substantially reduces the issues that plague other large image datasets [8,55]. We construct a multi-label dataset using these images by converting all hashtags into their corresponding canonical targets (note that a single image may have multiple hashtags).…”
Section: Hashtag Dataset Collectionmentioning
confidence: 99%
“…1 The images were subjected to an array of automated filters designed to remove potentially offensive content. While certainly not perfect, this substantially reduces the issues that plague other large image datasets [8,55]. We construct a multi-label dataset using these images by converting all hashtags into their corresponding canonical targets (note that a single image may have multiple hashtags).…”
Section: Hashtag Dataset Collectionmentioning
confidence: 99%
“…In the supplementary material we include experiments with an out-of-training face dataset Kärkkäinen & Joo (2019). Although we are aware of the ethical issues with ImageNet, and share the concerns over its nonconsensual content Prabhu & Birhane (2020), a direct comparison to existing results in the literature requires us to use the dataset.…”
Section: Data Preprocessingmentioning
confidence: 99%
“…Note, to our knowledge, these datasets are not known to contain personally identifiable information or offensive content. Although CIFAR-10 and CINIC-100 use images from the problematic ImageNet and Tiny Images [32], they contain manually selected subsets. The list of dataset-model combinations, or tasks, available in the trained model corpus can be seen in the first two rows of Table 1.…”
Section: Generalization Predictions: Experimental Setupmentioning
confidence: 99%