Influence of statistical feature normalisation methods on K-Nearest Neighbours and K-Means in the context of industry 4.0

Niño-Adan, Iratxe; Landa-Torres, Itziar; Portillo, Eva; Manjarres, Diana

doi:10.1016/j.engappai.2022.104807

Cited by 14 publications

(5 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(where w, x, b are the weight vector, feature vector and bias term) is used to determine the class for each instance. The k-nearest neighbour classifier [20] uses the Euclidean distance to measure the similarity between instances. The Euclidean distance is defined as…”

Section: Machine Learning Algorithmsmentioning

confidence: 99%

Analysis of patient data to explore cardiovascular risk factors

Almushayqih,

Oke,

Juma

2024

Mathematical Modelling and Numerical Simulation With Applications

View full text Add to dashboard Cite

According to the World Health Organisation, cardiovascular diseases claim over 17.9 million lives yearly on a global scale. Hence, cardiovascular diseases are responsible for 32 percent of global deaths yearly. Furthermore, it is estimated that more than 50 percent of heart disease cases are only discovered after they have reached the critical stage of heart failure and stroke. However, early detection of these heart diseases can reduce the mortality rates of cardiovascular diseases. Scientists have suggested using machine learning algorithms to identify the risk factors. However, the unavailability of data has hindered the significant success of this approach. In this study, machine learning algorithms are used to identify the important features that should be monitored to prevent heart diseases by considering a dataset obtained from 1000 patients. The six machine learning algorithms used for this study are Logistic Regression, Support Vector Machine, k-nearest Neighbour, Decision Tree, Random Forest and Multi-layer Perception Classifier. The dataset consists of twelve features that are considered to be associated with heart disease and a target variable. The results from this study show that patients suffering from typical and atypical angina chest pain are prone to heart disease. Patients who exercise up the slope have a higher likelihood of living without heart disease. Among the six algorithms used, the MLP Multi-layer Perception Classifier outperforms all others by achieving a 99 percent accuracy. Moreover, the Random Forest algorithm follows with an accuracy of 98 percent.

show abstract

Section: Machine Learning Algorithmsmentioning

confidence: 99%

Analysis of patient data to explore cardiovascular risk factors

Almushayqih,

Oke,

Juma

2024

Mathematical Modelling and Numerical Simulation With Applications

View full text Add to dashboard Cite

show abstract

“…The higher the number of clusters 𝑘, the more accurate the data partitioning will be and the compactness among the members of the classes will increase, and the corresponding SSE values will decrease. When the number of clusters does not reach the optimum, the compactness of the intra-class members increases substantially with the increase of the value of 𝑘, and the decrease of SSE increases substantially with the increase of the value of 𝑘 [19]. When the number of clusters reaches an optimal value, if the 𝑘 value continues to be increased, the increase in compactness among the members of the class decreases rapidly and the decrease in the SSE gradient slows down, using the elbow method to determine the 𝑘 value, which also facilitates the use of the K-means algorithm in the next step.…”

Section: K-means Clustering Algorithmmentioning

confidence: 99%

Analysis of media content recommendation in the new media era considering scene clustering algorithm

Yu,

2023

Applied Mathematics and Nonlinear Sciences

View full text Add to dashboard Cite

With the advent of the new media era, we face the problem of information overload every day, based on which this paper proposes a scenario-conscious clustering algorithm for recommending media content. The improved K-means algorithm is used to cluster the media content, an initial cluster center is randomly selected, and the remaining initial cluster centers are obtained by executing the improved algorithm to reduce the number of iterations and avoid neighboring situations. The clustering algorithm is then compared with content-based recommendation techniques, KNN recommendation algorithm, and collaborative filtering recommendation algorithm in terms of accuracy, recall, MAE value, and execution time. The clustering algorithm is significantly better than the other algorithms regarding accuracy and recall. The accuracy of the clustering algorithm is 0.4 when the recommendation sequence is 30, which is 0.4 higher than the collaborative filtering technique, 0.1 higher than the KNN, and 0.12 higher than the content recommendation method. In terms of MAE value, the clustering algorithm outperforms the other algorithms when the number of nearest neighbors is selected to be above 20. In terms of execution time, the longer the amount of data, the more obvious the advantage of the clustering algorithm. Therefore, the applicability and reliability of the model proposed in this paper for media content recommendation are verified in terms of accuracy, recall and execution time, which meet the design requirements.

show abstract

“…Max-Min normalization, however, does not handle newly introduced outlier data well, while Logistic normalization assumes that the dataset is distributed around zero, which is not consistent with our research dataset. Therefore, we have chosen the Z-Score standardization method, which effectively eliminates the inconvenience caused by data with different magnitudes for data analysis and ensures comparability between the data points [21].…”

Section: Data Preprocessingmentioning

confidence: 99%

Digital Twin-Based Fault Diagnosis Platform for Final Rolling Temperature in Hot Strip Production

Desheng,

Jian,

Mingxin

et al. 2023

Materials

View full text Add to dashboard Cite

The final rolling temperature in hot rolling is an important process parameter for hot-rolled strips and greatly influences their mechanical properties and rolling stability. The diagnosis of final rolling temperature anomalies in hot rolling has always been difficult in industry. A data-driven risk assessment method for detecting final rolling temperature anomalies is proposed. In view of the abnormal setting value for the strip head, a random forest model is established to screen the process parameters with high feature importance, and the isolation forest algorithm is used to evaluate the risk associated with the remaining parameters. In view of the abnormal process curve of the full length of the strip, the Hausdorff distance algorithm is used to eliminate samples with large deviations, and a risk assessment of the curve is carried out using the LCSS algorithm. Aiming to understand the complex coupling relationship between the influencing factors, a method for identifying the causes of anomalies, combining a knowledge graph and a Bayesian network, is established. According to the results of the strip head and the full-length risk assessment model, the occurrence of the corresponding nodes in the Bayesian network is determined, and the root cause of the abnormality is finally output. By combining mechanistic modeling and data modeling techniques, it becomes possible to rapidly, automatically, and accurately detect and analyze final rolling temperature anomalies during the rolling process. When applying the system in the field, when compared to manual analysis by onsite personnel, the accuracy of deducing the causes of anomalies was found to reach 92%.

show abstract

Influence of statistical feature normalisation methods on K-Nearest Neighbours and K-Means in the context of industry 4.0

Cited by 14 publications

References 36 publications

Analysis of patient data to explore cardiovascular risk factors

Analysis of patient data to explore cardiovascular risk factors

Analysis of media content recommendation in the new media era considering scene clustering algorithm

Digital Twin-Based Fault Diagnosis Platform for Final Rolling Temperature in Hot Strip Production

Contact Info

Product

Resources

About