Context Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.
Research in software repository mining has grown considerably the last decade. Due to the data-driven nature of this venue of investigation, we identified several problems within the current state-of-the-art that pose a threat to the external validity of results. The heavy re-use of data sets in many studies may invalidate the results in case problems with the data itself are identified. Moreover, for many studies data and/or the implementations are not available, which hinders a replication of the results and, thereby, decreases the comparability between studies. Even if all information about the studies is available, the diversity of the used tooling can make their replication even then very hard. Within this paper, we discuss a potential solution to these problems through a cloud-based platform that integrates data collection and analytics. We created the prototype SmartSHARK that implements our approach. Using SmartSHARK, we collected data from several projects and created different analytic examples. Within this article, we present Smart-SHARK and discuss our experiences regarding the use of SmartSHARK and the mentioned problems.
Software mining is concerned with two primary goals: the extraction of basic facts from software repositories and the derivation of knowledge resulting from the assessment of the basic facts. Facts extraction approaches rely on custom and task-specific infrastructures and tools. The resulting facts assets are usually represented in heterogeneous formats at a low level of abstraction. Due to this, facts extracted from different sources are also not well integrated, even if they are related. To manage this, existing infrastructures often aim at supporting an all-in-one information meta-structures which try to integrate all facts in one connected whole. We propose a generic infrastructure that translates extracted facts to homogeneous high-level representations conforming to domain-specific metamodels, and then transforms these high-level model instances to instances of domain-specific models related to a particular assessment task, which can be incrementally enriched with additional facts as these become available or necessary. This allows researchers and practitioners to focus on the assessment task at hand, without being concerned with low-level representation details or complex data models containing large amounts of often irrelevant data. We present an example scenario with a concrete instantiation of the proposed infrastructure targeting the assessment of developer behaviour.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.