Standard vs. non-standard cross-validation: evaluation of performance in a space with structured distribution of datapoints

Baron, Grzegorz; Stańczyk, Urszula

doi:10.1016/j.procs.2021.08.128

Cited by 8 publications

(6 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Standard cross-validation approaches advocate the random selection of samples from a set, repeated a certain number of times, and averaging results over such folds [ 65 ]. However, in the case of stylometric input space, experiments show that random choice is not always the best way to go, even with increasing the number of folds above popular standards [ 66 ]. The problem lies in the specific distribution of datapoints in space, which is caused by the initial pre-processing of text samples.…”

Section: Proposed Methodologymentioning

confidence: 99%

Pruning Decision Rules by Reduct-Based Weighting and Ranking of Features

Stańczyk

2022

Entropy

View full text Add to dashboard Cite

Methods and techniques of feature selection support expert domain knowledge in the search for attributes, which are the most important for a task. These approaches can also be used in the process of closer tailoring of the obtained solutions when dimensionality reduction is aimed not only at variables but also at learners. The paper reports on research where attribute rankings were employed to filter induced decision rules. The rankings were constructed through the proposed weighting factor based on the concept of decision reducts—a feature reduction mechanism embedded in the rough set theory. Classical rough sets operate only in discrete input space by indiscernibility relation. Replacing it with dominance enables processing real-valued data. Decision reducts were found for both numeric and discrete attributes, transformed by selected discretisation approaches. The calculated ranking scores were used to control the selection of decision rules. The performance of the resulting rule classifiers was observed for the entire range of rejected variables, for decision rules with conditions on continuous values, discretised conditions, and also inferred from discrete data. The predictive powers were analysed and compared to detect existing trends. The experiments show that for all variants of the rule sets, not only was dimensionality reduction possible, but also predictions were improved, which validated the proposed methodology.

show abstract

Section: Proposed Methodologymentioning

confidence: 99%

Pruning Decision Rules by Reduct-Based Weighting and Ranking of Features

Stańczyk

2022

Entropy

View full text Add to dashboard Cite

show abstract

“…To avoid falsely optimistic test results, for the evaluation of learnt patterns, samples should never be used based on the same texts that are used for training. Standard crossvalidation, with its foundation of random choice of samples for folds, cannot be trusted to return a reliable estimation of classification accuracy [32]. Instead, nonstandard crossvalidation, with swapping whole groups of samples (instead of individual instances) could be attempted, but it results in highly increased computational costs.…”

Section: Construction Of Datasetsmentioning

confidence: 99%

“…The average of the partial results then leads to the final outcome. In the stylometric domain, this approach proves problematic due to the existing stratification of the input space [32]. Data points are grouped by the original long works they are based on.…”

Section: Evaluation Of Performancementioning

confidence: 99%

Significance of Single-Interval Discrete Attributes: Case Study on Two-Level Discretisation

Stańczyk,

Zielosko,

Baron

2024

Applied Sciences

Self Cite

View full text Add to dashboard Cite

Supervised discretisation is widely considered as far more advantageous than unsupervised transformation of attributes, because it helps to preserve the informative content of a variable, which is useful in classification. After discretisation, based on employed criteria, some attributes can be found irrelevant, and all their values can be represented in a discrete domain by a single interval. In consequence, such attributes are removed from considerations, and no knowledge is mined from them. The paper presents research focused on extended transformations of attribute values, thus combining supervised with unsupervised discretisation strategies. For all variables with single intervals returned from supervised algorithms, the ranges of values were transformed by unsupervised methods with varying numbers of bins. Resulting variants of the data were subjected to selected data mining techniques, and the performance of a group of classifiers was evaluated and compared. The experiments were performed on a stylometric task of authorship attribution.

show abstract

“…A k-fold cross-validation method is used to validate each identification performance index from the result of the testing set of a CNN model. This research uses the value of k = 5, which is the standard k value for cross-validation [24,25]. In the experiments, both the training and testing sets are partitioned into five folds and randomly shuffled.…”

Section: K-fold Cross-validationmentioning

confidence: 99%

Towards Improved Disease Identification With Pretrained Convolutional Neural Networks as Feature Extractors for Chili Leaf Images

Aminuddin,

Abdul Kadir,

Md Tomari

et al. 2024

Jurnal Teknologi

View full text Add to dashboard Cite

Chili is a popular crop that is widely grown due to its flavorful and spicy fruit that is nutritionally beneficial. For the benefit of economic growth, it is important to precisely assess the chili health. With the advancement of computer vision-based applications, methods such as feature descriptors have been utilized to assist farm owners in identifying chili diseases via chili leaf images. However, these feature descriptors still require the manual extraction of disease features in order to accurately identify chili diseases. In this research, pretrained Convolutional Neural Networks (CNNs) are proposed as feature extractors to identify healthy and diseased chili leaf images. Three pretrained CNN models, DenseNet-201, EfficientNet-b0, and NasNet-Mobile, are utilized for their ability to identify healthy and diseased chili leaf using five indexes: accuracy, recall, specificity, precision, and F1-score. These indexes are validated through a five-fold cross-validation method during the experiments. The experimental results show that the EfficientNet-b0 model achieved the highest identification performance, with indexes of accuracy, recall, specificity, precision, and F1-score of 97.05%, 0.97, 0.92, 0.92, and 0.94, respectively. Therefore, the use of pretrained CNNs as feature extractors has the capability to enhance the efficiency and accuracy of chili disease identification in agricultural settings.

show abstract

Standard vs. non-standard cross-validation: evaluation of performance in a space with structured distribution of datapoints

Cited by 8 publications

References 13 publications

Pruning Decision Rules by Reduct-Based Weighting and Ranking of Features

Pruning Decision Rules by Reduct-Based Weighting and Ranking of Features

Significance of Single-Interval Discrete Attributes: Case Study on Two-Level Discretisation

Towards Improved Disease Identification With Pretrained Convolutional Neural Networks as Feature Extractors for Chili Leaf Images

Contact Info

Product

Resources

About