Through software analytics, raw data with low value originates information that is valuable and able to provide insights, enabling the support of claims that would otherwise not be possible to verify. The software development ecosystem has plenty of sources that can help understanding the quality of processes and products but, to reach that goal, it is necessary to collect and store the data. This paper describes an infrastructure to allow the collection, storage and analysis of data from software repositories. The scope of the research is an industrial case study, which encompasses several specificities: tools and work methodology. The current solution is able to collect information from the continuous delivery & deployment pipeline, which includes data sources such as the source code repository (SVN), the static analysis tool (SonarQube), the continuous integration server (from Jenkins jobs) and the continuous testing tool (an in-house tool called Cerberus). Future work also includes the implementation of components that will allow the collection of unstructured data from the bug-tracking system and incident management tool. As stated in the literature, correlating the history of issues and incidents will allow the team to address, or at least identify, areas of improvement.
Process mining (PM) is a unique approach to extract workflow models of actual real-world activities, namely those related to software development. To be efficient and produce more reliable results, its algorithms require structured input data. However, actual real-world data originate from multiple heterogeneous sources; thus, integration and normalization are required preparatory steps before applying PM techniques. This problem is exacerbated by the need of performing this analysis in real time, rather than off-line in a batch-style approach. In this paper, we show how Apache Kafka pipelines can be used to support the integration and normalization of the event logs from multiple sources into data streams that feed the process mining algorithms in real-time. An application to the complex CI/CD pipeline of a major European e-commerce company is presented, showing that these techniques provide means to monitor and have higher observability of development processes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.