With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.
A transition toward a sustainable way of living is more pressing than ever. One link to achieving this transition is to increase the currently low level of sustainable consumption, and sustainability labeling has been shown to directly influence sustainable purchasing decisions. E-commerce retailers have recently picked up on a means to inform online shoppers about sustainable alternatives by introducing on their websites third-party and private sustainability labels as nudging instruments. However, despite its increasing relevance in practice, research lacks evidence about the availability and credibility of sustainability labeling in online retail. Our study is guided by the question of how online retailers use sustainability labels to communicate information on the sustainability of products to consumers. Our empirical research is based on a large-scale dataset containing sustainability information of nearly 17,000 fashion products of the leading online retailers in Germany Zalando and Otto. The results show that a large number of fashion products are tagged as sustainable, with two-thirds carrying a private label and one-third a third-party verified label. Only 14% of the tagged products, however, present credible third-party verified sustainability labels. This low percentage makes it challenging for consumers to comprehend to what degree a product is sustainable. The wide distribution of private labels indicates that most of the available sustainability information in the selected online shops addresses only single sustainability issues, preventing comparability. Furthermore, label heterogeneity can add to the confusion and uncertainty among consumers. Our practical recommendations support political initiatives that tackle the risk of greenwashing resulting from uncertified and weak sustainability information.
The production, shipping, usage, and disposal of consumer goods have a substantial impact on greenhouse gas emissions and the depletion of resources. Modern retail platforms rely heavily on Machine Learning (ML) for their search and recommender systems. Thus, ML can potentially support efforts towards more sustainable consumption patterns, for example, by accounting for sustainability aspects in product search or recommendations. However, leveraging ML potential for reaching sustainability goals requires data on sustainability. Unfortunately, no open and publicly available database integrates sustainability information on a product-by-product basis. In this work, we present the GreenDB, which fills this gap. Based on search logs of millions of users, we prioritize which products users care about most. The GreenDB schema extends the well-known schema.org Product definition and can be readily integrated into existing product catalogs to improve sustainability information available for search and recommendation experiences. We present our proof of concept implementation of a scraping system that creates the GreenDB dataset.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.