Source code classification (SCC) is a task to assign codes into different categories according to a criterion such as according to their functionalities, programming languages or vulnerabilities. Many source code archives are organized according to the programming languages, and thereby, the desired code fragments can be easily accessed by searching within the archive. However, manually organizing source code archives by field experts is labor intensive and impractical because of the fastgrowing available source codes. Therefore, this study proposes new convolutional neural network (CNN) architectures to build source code classifiers that automatically identify programming languages from source codes. This is the first study in which the performances of deep learning algorithms on programming language identification are compared on both image and text files. In this study, the experiments are performed on three source code datasets to identify eight programming languages, including C, C++, C# , Go, Python, Ruby, Rust, and Java. The comparative results indicate that although textbased SCC and image-based SCC approaches achieve very high ( > 93.5% ) and similar accuracies, text-based classification has significantly better performance in terms of execution time.
Software engineering is one of the most utilizable research areas for data mining. Developers have attempted to improve software quality by mining and analyzing software data. In any phase of software development life cycle (SDLC), while huge amount of data is produced, some design, security, or software problems may occur. In the early phases of software development, analyzing software data helps to handle these problems and lead to more accurate and timely delivery of software projects. Various data mining and machine learning studies have been conducted to deal with software engineering tasks such as defect prediction, effort estimation, etc. This study shows the open issues and presents related solutions and recommendations in software engineering, applying data mining and machine learning techniques.
Multi-view learning (MVL) is a special type of machine learning that utilizes more than one views, where views include various descriptions of a given sample. Traditionally, classification algorithms such as k-nearest neighbors (KNN) are designed for learning from single-view data. However, many real-world applications involve datasets with multiple views and each view may contain different and partly independent information, which makes the traditional single-view classification approaches ineffective. Therefore, this article proposes an improved MVL algorithm, called Multi-View K-Nearest Neighbors (MVKNN), based on the existing KNN algorithm. The experimental results conducted in this research show that a significant improvement is achieved by the proposed MVKNN algorithm compared to the well-known machine learning algorithms (KNN, support vector machine, decision tree, and Naive Bayes) in the case of multi-view data. The results also show that our method outperforms the state-of-the-art multi-view learning methods in terms of accuracy.
As a result of the continuous growth in the amount of geological data, machine learning (ML) offers an opportunity to contribute to solving problems in geosciences. However, digital geology applications introduce new challenges for machine learning due to the unique geoscience properties encountered in each problem, requiring novel research in ML. This paper proposes a novel machine learning method, entitled “Partial Decision Tree Forest (PART Forest)”, to overcome these challenges introduced by geoscience problems and offers potential advancements in both machine learning and geoscience disciplines. The effectiveness of the proposed PART Forest method was illustrated in mineral classification. This study aims to build an intelligent ML model that automatically classifies the minerals in terms of their crystal structures (triclinic, monoclinic, orthorhombic, tetragonal, hexagonal, and trigonal) by taking into account their chemical compositions and their physical and optical properties. In the experiments, the proposed PART Forest method demonstrated its superiority over one of the well-known ensemble learning methods, random forest, in terms of accuracy, precision, recall, f-score, and AUC (area under the curve) metrics.
Background: Traditionally, machine learning algorithms have been simply applied for software defect prediction by considering single-view data, meaning the input data contains a single feature vector. Nevertheless, different software engineering data sources may include multiple and partially independent information, which makes the standard single-view approaches ineffective. Objective: In order to overcome the single-view limitation in the current studies, this article proposes the usage of a multi-view learning method for software defect classification problems. Method: The Multi-View k-Nearest Neighbors (MVKNN) method was used in the software engineering field. In this method, first, base classifiers are constructed to learn from each view, and then classifiers are combined to create a robust multi-view model. Results: In the experimental studies, our algorithm (MVKNN) is compared with the standard k-nearest neighbors (KNN) algorithm on 50 datasets obtained from different software bug repositories. The experimental results demonstrate that the MVKNN method outperformed KNN on most of the datasets in terms of accuracy. The average accuracy values of MVKNN are 86.59%, 88.09%, and 83.10% for the NASA MDP, Softlab, and OSSP datasets, respectively. Conclusion: The results show that using multiple views (MVKNN) can usually improve classification accuracy compared to a single-view strategy (KNN) for software defect prediction.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.