A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

Simmons, Anj; Barnett, Scott; Rivera-Villicana, Jessica; Bajaj, Akshat; Vasa, Rajesh

doi:10.1145/3382494.3410680

Cited by 19 publications

(25 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This step is also performed by our analysis tool and concerns running the static code analysis tool Pylint (version 2.6.0) in its default configuration on all pure Python files in each project (but not on any of the dependencies). We choose Pylint for static code analysis as it is widely used and widely accepted in the Python community, as well as being highly configurable [6,10]. It is also well integrated into IDEs such as PyCharm and VS Code.…”

Section: Static Analysismentioning

confidence: 99%

“…Our study differs from [6] in that we do not compare against non-DS projects and in that we do not solely focus on the adherence to coding standards as [6] does. Our primary focus lies more on investigating obstructions to the maintainability and reproducibility of ML projects, which includes coding standards violations, but also entails recognising refactoring opportunities and other code smells [7].…”

Section: Introductionmentioning

confidence: 99%

“…Menzies also advocates for more SE experience in the field of AI and ML, stating that poor SE leads to poor AI while better SE leads to better AI [5]. The data scientists that write AI / ML code often come from non-SE backgrounds where SE best practices are unknown [6].…”

Section: Introductionmentioning

confidence: 99%

“…Within the Machine Learning ecosystem, we only found one paper by Simmons et al [6] that performed static code analysis on a large dataset of Data Science (DS) projects. They also analysed non-DS projects with the goal of comparing the code quality and coding standard conformance of (opensource) DS projects versus non-DS projects, using Pylint in its default configuration as a metric.…”

Section: Introductionmentioning

confidence: 99%

“…Furthermore, Simmons et al [6] simplified the installation of the projects' dependencies by using findimports 1 to resolve all imports used in the projects, instead of relying on what projects' authors defined in their repositories, noting that "it was impractical to reliably determine and install dependencies for the projects analysed." However, if there is an inherent difficulty in resolving these dependencies within Python projects, then that is in itself an obstruction to the reproducibility and maintainability of these projects.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

The Prevalence of Code Smells in Machine Learning projects

Oort

Cruz

Aniche

et al. 2021

2021 IEEE/ACM 1st Workshop on AI Engineering - Software Engineering for AI (WAIN)

View full text Add to dashboard Cite

Artificial Intelligence (AI) and Machine Learning (ML) are pervasive in the current computer science landscape. Yet, there still exists a lack of software engineering experience and best practices in this field. One such best practice, static code analysis, can be used to find code smells, i.e., (potential) defects in the source code, refactoring opportunities, and violations of common coding standards. Our research set out to discover the most prevalent code smells in ML projects. We gathered a dataset of 74 open-source ML projects, installed their dependencies and ran Pylint on them. This resulted in a top 20 of all detected code smells, per category. Manual analysis of these smells mainly showed that code duplication is widespread and that the PEP8 convention for identifier naming style may not always be applicable to ML code due to its resemblance with mathematical notation. More interestingly, however, we found several major obstructions to the maintainability and reproducibility of ML projects, primarily related to the dependency management of Python projects. We also found that Pylint cannot reliably check for correct usage of imported dependencies, including prominent ML libraries such as PyTorch.

show abstract

Section: Static Analysismentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

The Prevalence of Code Smells in Machine Learning projects

Oort

Cruz

Aniche

et al. 2021

2021 IEEE/ACM 1st Workshop on AI Engineering - Software Engineering for AI (WAIN)

View full text Add to dashboard Cite

show abstract

Comparative analysis of real issues in open-source machine learning projects

Lai,

Simmons,

Barnett

et al. 2024

Empir Software Eng

View full text Add to dashboard Cite

Context In the last decade of data-driven decision-making, Machine Learning (ML) systems reign supreme. Because of the different characteristics between ML and traditional Software Engineering systems, we do not know to what extent the issue-reporting needs are different, and to what extent these differences impact the issue resolution process. Objective We aim to compare the differences between ML and non-ML issues in open-source applied AI projects in terms of resolution time and size of fix. This research aims to enhance the predictability of maintenance tasks by providing valuable insights for issue reporting and task scheduling activities. Method We collect issue reports from Github repositories of open-source ML projects using an automatic approach, filter them using ML keywords and libraries, manually categorize them using an adapted deep learning bug taxonomy, and compare resolution time and fix size for ML and non-ML issues in a controlled sample. Result 147 ML issues and 147 non-ML issues are collected for analysis. We found that ML issues take more time to resolve than non-ML issues, the median difference is 14 days. There is no significant difference in terms of size of fix between ML and non-ML issues. No significant differences are found between different ML issue categories in terms of resolution time and size of fix. Conclusion Our study provided evidence that the life cycle for ML issues is stretched, and thus further work is required to identify the reason. The results also highlighted the need for future work to design custom tooling to support faster resolution of ML issues.

show abstract

MLSmellHound: A Context-Aware Code Analysis Tool

Kannan

Barnett

Cruz

et al. 2022

2022 IEEE/ACM 44th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER)

Self Cite

View full text Add to dashboard Cite

Meeting the rise of industry demand to incorporate machine learning (ML) components into software systems requires interdisciplinary teams contributing to a shared code base. To maintain consistency, reduce defects and ensure maintainability, developers use code analysis tools to aid them in identifying defects and maintaining standards. With the inclusion of machine learning, tools must account for the cultural differences within the teams which manifests as multiple programming languages, and conflicting definitions and objectives. Existing tools fail to identify these cultural differences and are geared towards software engineering which reduces their adoption in ML projects. In our approach we attempt to resolve this problem by exploring the use of context which includes i) purpose of the source code, ii) technical domain, iii) problem domain, iv) team norms, v) operational environment, and vi) development lifecycle stage to provide contextualised error reporting for code analysis. To demonstrate our approach, we adapt Pylint as an example and apply a set of contextual transformations to the linting results based on the domain of individual project files under analysis. This allows for contextualised and meaningful error reporting for the end user. CCS CONCEPTS• Software and its engineering → Software maintenance tools.

show abstract

A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

Cited by 19 publications

References 23 publications

The Prevalence of Code Smells in Machine Learning projects

The Prevalence of Code Smells in Machine Learning projects

Comparative analysis of real issues in open-source machine learning projects

MLSmellHound: A Context-Aware Code Analysis Tool

Contact Info

Product

Resources

About