Software repositories are rich sources of information about the software development process. Mining the information stored in them has been shown to provide interesting insights into the history of the software development and evolution. Several different types of information have been extracted and analyzed from different points of view. However, these types of information have not been sufficiently cross-examined to understand how they might complement each other. In this paper, we present a systematic analysis of four aspects of the software repository of an open source project -source-code metrics, identifiers, return-on-investment estimates, and design differencing -to collect evidence about refactorings that may have happened during the project development.In the context of this case study, we comparatively examine how informative each piece of information is towards understanding the refactoring history of the project and how costly it is to obtain.
Motivation and IntroductionSoftware repositories are rich sources of information about the software-development process, and mining this information has been shown to provide interesting insights into the lifecycle of a project and the design rationale underlying its evolution. Several different types of information have been extracted and analyzed to collect evidence about different system properties and various trends and events in the process through which it was developed and evolved.For example, researchers have worked on assessing different system qualities. Bevan and Whitehead [1] developed a method for detecting "unstable" areas of software, i.e., areas modified more frequently than average, based on static dependence graphs. Inspired by chaos theory, Hassan and Holt [8] devised a system-complexity metric based on the software-development process: their study of the CVS history of several open-source projects showed that, indeed, a chaotic/complex development process negatively affects the quality of the source-code product.A lot of work has also been devoted to recognizing "change patterns" in the software evolution history. Module co-evolution has been studied as a means for predicting the impact of changes [16, 20, 21]. Godfrey and Zou [7] have shown how to detect merging and splitting of files and functions in procedural code using origin analysis. Especially interesting to the research community are refactorings, i.e., behavior-preserving structural change patterns [5,10,14]. Demeyer's group has had a long-term focus on detecting refactorings. They initially proposed a set of heuristics for recognizing the general type of refactoring that a system has gone through based on changes in the source-code size [4]. They then proceeded to investigate the use of clone-detection to identify move and renaming refactorings [15]. In our own work with design differencing [17,19], we have shown how the UMLDiff algorithm for semantic tree differencing of UML class diagrams can reveal the elementary design changes between two software versions, which can then b...