Hostility measure for multi-level study of data complexity

Lancho, Carmen; Diego, Isaac Martín de; Cuesta, Marina; Aceña, Víctor; Moguerza, Javier M.

doi:10.1007/s10489-022-03793-w

Cited by 7 publications

(4 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, some of the original complexity measures have been adapted to the instance level [3]. Recently, a multi-level analysis of data complexity has been addressed, covering the instance, the class, and the dataset level with a new proposed complexity measure called hostility measure [17].…”

Section: State-of-the-artmentioning

confidence: 99%

“…However, the instance perspective of data complexity has fostered their use in tasks related to IS like noise filter or data sampling. For example, in [17], they filter a 10% and a 50% of the most complex points, reducing the error. In [34], data complexity is employed for curriculum learning.…”

Section: State-of-the-artmentioning

confidence: 99%

“…Consequently, the k would have to be increased to achieve a smoother measure, thus moving away from the recommended parameters. Indeed, the lack of smoothness resulting from the value of k is one of the criticisms of the kDN [17].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Dynamic Disagreeing Neighbors: A complexity measure for instance selection

Aceña,

Lancho,

de Diego

et al. 2024

Preprint

Self Cite

View full text Add to dashboard Cite

Within the scope of Machine Learning (ML), Instance Selection (IS) is a sampling process that consists of filtering noise and removing redundant data. In classification problems, IS implies a compromise between maximizing performance and reducing the sample size used for training. Complexity measures provide relevant information about the difficulty of classifying instances, which makes them appropriate for IS since they can capture, for example, noisy or borderline points. This paper introduces the complexity measure DDN, defined at three levels: instance, class, and dataset. The Dynamic Disagreeing Neighbors (DDN) of an instance is defined as the percentage of its nearest neighbors that belong to other classes. The DDN is based on the Nearest Centroid Neighbors (NCN) neighborhood computation, which is dynamically adjusted to the data distribution. In addition, the distance of each neighbor is taken into account so that those farther away are less influential and those closer are more influential for the instance complexity. The validity of the proposal is evaluated through a series of experiments where it is compared with the widely known k-Disagreeing Neighbors (kDN) in terms of stability, correlation with classification error, and performance in IS. The DDN has shown competitive, stable, and robust results throughout the experiments, generally improving on those obtained with the alternative.

show abstract

Section: State-of-the-artmentioning

confidence: 99%

Section: State-of-the-artmentioning

confidence: 99%

See 1 more Smart Citation

Dynamic Disagreeing Neighbors: A complexity measure for instance selection

Aceña,

Lancho,

de Diego

et al. 2024

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Hance, improving the dataset characteristics is crucial to enhance the performance of classification. These characteristics can include overlapping classes, linearity of bound decisions, and imbalance ratio in the dataset (Lancho et al 2023). Ho and Basu (2002) introduced a measurement to assess the dataset characteristics by examining the geometrical distribution of data.…”

Section: Introductionmentioning

confidence: 99%

Effectiveness of SMOTE-ENN to Reduce Complexity in Classification Model

Riantika,

Sartono,

Anwar Notodiputro

2024

IJSA

View full text Add to dashboard Cite

A failure to produce classification models with high performance might be caused by the dataset's characteristics, such as the between-class overlapping and the class imbalance. The higher the data complexity, the more complicated it is for the algorithm to find good models. Combining the issues of class imbalance and overlapping would make the problem more challenging. To deal with this problem, this research implemented a hybrid class-balancing technique named SMOTE-ENN. This technique adds observations to the minority class to balance the class frequencies. After that, it removes some observations to reduce the degree of overlapping. The research revealed that SMOTE-ENN succeeds in doing that. We employed a random forest method to evaluate it. In 28 out of 46 cases we investigated, the new datasets generated by SMOTE-ENN could produce models with higher accuracy.

show abstract