The relationship between data skewness and accuracy of Artificial Neural Network predictive model

Larasati, Aisyah; Hajji, Apif M.; Dwiastuti, Anik

doi:10.1088/1757-899x/523/1/012070

Cited by 9 publications

(2 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There were also 225 non-annotations, which were shuffled and split equally into three separate sets and added to the datasets of each annotator in order to learn the models to distinguish deontic and non-deontic sentences. Normal data distribution and neutral skewness (skewness between -0.05 and 0.05) were preferred as they generally lower the misclassification rate and bias of classifiers (Trafimow et al, 2018;Larasati et al, 2019;Liu et al, 2019). Since the datasets were highly skewed towards the 'Obligation' class (> 2 right-skewness), we undersampled (randomly) the obligations so that each dataset contained less than 100 instances of obligations, thus, reducing the skewness to around 1.…”

Section: Data Collection and Annotationmentioning

confidence: 99%

Low-Resource Deontic Modality Classification in EU Legislation

Minkova,

Chakravarthy,

Dijck

2023

Proceedings of the Natural Legal Language Processing Workshop 2023

View full text Add to dashboard Cite

In law, it is important to distinguish between obligations, permissions, prohibitions, rights, and powers. These categories are called deontic modalities. This paper evaluates the performance of two deontic modality classification models, LEGAL-BERT and a Fusion model, in a low-resource setting. To create a generalized dataset for multi-class classification, we extracted random provisions from European Union (EU) legislation. By finetuning previously researched and published models, we evaluate their performance on our dataset against fusion models designed for lowresource text classification. We incorporate focal loss as an alternative for cross-entropy to tackle issues of class imbalance. The experiments indicate that the fusion model performs better for both balanced and imbalanced data with a macro F1-score of 0.61 for imbalanced data, 0.62 for balanced data, and 0.55 with focal loss for imbalanced data. When focusing on accuracy, our experiments indicate that the fusion model performs better with scores of 0.91 for imbalanced data, 0.78 for balanced data, and 0.90 for imbalanced data with focal loss.

show abstract

Section: Data Collection and Annotationmentioning

confidence: 99%

Low-Resource Deontic Modality Classification in EU Legislation

Minkova,

Chakravarthy,

Dijck

2023

Proceedings of the Natural Legal Language Processing Workshop 2023

View full text Add to dashboard Cite

show abstract

“…The quality of data is one important aspect for data scientists and statisticians, whereby they would aim to understand the distribution(s) present in the data to be able to apply appropriate measures and procedures for better interpretation of the results (Varshney, 2020). Whereas, the Shapiro Wilk normality test is one of the data normality test techniques (Malato, 2022;Royston, 1983;Royston, 1992;Yazici & Yolacan, 2007), herein we employed the quantilequantile (QQ) or simply quantile plots which aid in the visualization of the distributions available in the random variables by plotting these random variables on the y-axis, and the normal distribution on the x-axis, such that the plot between would a visualization of the present data distribution such that if the quantile points lies across the straight line y=x then it is a normal distribution, otherwise if the right side is above the y-x line and the left side is around the line, then it is right-skewed, likewise if the right side is around the line and the left is below then it is a left-skewed (Chan, 2022;Larasati et al, 2019;Varshney, 2020). This determination of which will aid in the requirement for the application of data normalization procedure before the effective application of consequent analytical and modeling techniques which work best at Gaussian distributions, and resultant models calibration, otherwise remedies such as data stratified sampling techniques could only aid if the issue was an imbalance type of concern.…”

Section: (Ii) Dataset Quality Testingmentioning

confidence: 99%

Developing a high-performance soil fertility status prediction voting ensemble using brute exhaustive optimization in automated multiprecision weights of hybrid classifiers

Josephat

View full text Add to dashboard Cite

With the advent of machine learning (ML) techniques, various algorithms have been applied in previous studies to develop models for predicting soil fertility status. However, these models are observed to use varying fertility target classes, and variations have been reported in these models' predictive performances. As a result, practical applications of these models for obtaining the most accurate predictions may become hindered. While the weighted voting ensemble (WVE) ML technique can be used to improve soil fertility status prediction by aggregating individual models prediction, guaranteeing finding of an optimal WVE assignment weights is challenging. Whereas a brute exhaustive search procedure can be applied for the mentioned task, there is a lack of exploration on the exploitation of automated classifiers' precise weights combinations as search spaces for successful optimization. This research aims to develop a high-performance soil fertility status prediction voting ensemble using brute exhaustive optimization in automated 1EXP(-)Z+ multi-precision weights of hybrid classifiers. Soil chemical properties and ML modeling algorithms for modeling soil fertility status were identified. Base hybrid ML classification models for predicting soil fertility status were evaluated using Tanzania as a case study. Finally, the base ML hybrids WVE models were optimized using brute exhaustive search procedure’s novel developed search spaces generation algorithm for guaranteed optimal solution finding. The research was designed using design science research methodology, with the application of unsupervised machine learning K-mean algorithm with a knee detection method to find the optimal number of soil fertility status target classes, and supervised learning algorithms were applied to model classifiers for those optimal classes. Three soil fertility target classes were identified by clustering technique. The model achieved on test data a predictive accuracy of 98.93%, with respective AUC of 82%, 83%, and 87% for low, medium, and high soil fertility targets classes. Whereas these performances are observed higher compared to models in previous studies, 92% correct classifications were obtained on validation against external unseen laboratory-based tested soil results. Therefore, soil testing laboratories and farmers should consider using the model to smartly manage soil fertility which may lead to improved crop growth and productivity. The government could set agricultural-related policies that require the use of the model by farmers with the provision of agricultural inputs subsidies. Future work could be to develop an integrated real-time web and mobile application for providing farmers with soil fertility status information.

show abstract