Determining Threshold Value on Information Gain Feature Selection to Increase Speed and Prediction Accuracy of Random Forest

Prasetiyowati, Maria Irmina; Maulidevi, Nur Ulfa; Surendro, Kridanto

doi:10.21203/rs.3.rs-132775/v1

Cited by 3 publications

(3 citation statements)

References 26 publications

(28 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…e information gain rate is the information gain divided by the amount of divided information [28,29]. For the training data set S, it consists of s samples.…”

Section: C45 Algorithm: Inductive Learning Mechanism Ismentioning

confidence: 99%

Design of Higher Education System Based on Artificial Intelligence Technology

Li¹,

Yang

et al. 2021

Discrete Dynamics in Nature and Society

View full text Add to dashboard Cite

In order to solve the problems of low security and response efficiency and slow running speed of the current designed higher education system, a higher education system based on artificial intelligence technology is designed. Firstly, according to the characteristics of artificial intelligence technology, intelligent teaching system, agent technology, and data mining technology are introduced in detail. Then it analyzes the overall and detailed functional requirements of the system and adaptively generates knowledge content and teaching mode suitable for students’ ability and personality by using intelligent reasoning ability and the collection and analysis of students’ personality characteristics. Through data mining of intelligent teaching system, the decision tree about curriculum is obtained, and the students’ cognitive ability is calculated. Based on the theory of cognitive science, using the “double master” teaching mode, combined with agent technology and intelligent teaching system, the system function is divided into six modules. Through the design of database structure and data table, the design of higher education system based on artificial intelligence technology is realized. The experimental results show that the proposed method has high security and response efficiency, fast running speed, and good teaching effect.

show abstract

“…e information gain rate is the information gain divided by the amount of divided information [28,29]. For the training data set S, it consists of s samples.…”

Section: C45 Algorithm: Inductive Learning Mechanism Ismentioning

confidence: 99%

Design of Higher Education System Based on Artificial Intelligence Technology

Li¹,

Yang

et al. 2021

Discrete Dynamics in Nature and Society

View full text Add to dashboard Cite

show abstract

“…To obtain the most appropriate words for predictive modeling, we applied the filter method to select features based on information gain (IG) [34,35]. If words having IG scores are greater than or equal to 0.05 [34], these words are chosen as features. Each selected word (or feature) is then weighted by the term frequency-inverse document frequency (tf-idf) scheme.…”

Section: Figure 1 Methods Of Developing the Polarity Label Analyzermentioning

confidence: 99%

Automatically Correcting Noisy Labels for Improving Quality of Training Set in Domain-specific Sentiment Classification

Khamket

Polpinij

2022

Curr. Appl. Sci. Technol.

View full text Add to dashboard Cite

Classification model performance can be degraded by label noise in the training set. The sentiment classification domain also struggles with this issue, whereby customer reviews can be mislabeled. Some customers give a rating score for a product or service that is inconsistent with the review content. If business owners are only interested in the overall rating picture that includes mislabeling, this can lead to erroneous business decisions. Therefore, this issue became the main challenge of this study. If we assume that customer reviews with noisy labels in the training data are validated and corrected before the learning process, then the training set can generate a predictive model that returns a better result for the sentiment analysis or classification process. Therefore, we proposed a mechanism, called polarity label analyzer, to improve the quality of a training set with noisy labels before the learning process. The proposed polarity label analyzer was used to assign the polarity class of each sentence in a customer review, and then polarity class of that customer review was concluded by voting. In our experiment, datasets were downloaded from TripAdvisor and two linguistic experts helped to assign the correct labels of customer reviews as the ground truth. Sentiment classifiers were developed using the k-NN, Logistic Regression, XGBoost, Linear SVM and CNN algorithms. After comparing the results of the sentiment classifiers without training set improvement and the results with training set improvement, our proposed method improved the average scores of F1 and accuracy by 20.59%.

show abstract

“…Furthermore, to make the finest attribute set, we need to use a cut-off value to pick the attribute set from the final ranked list captured after the aggregation procedure. In this study, three different threshold values were utilized to minimize the data and to pick the best appropriate attribute set [23]. The threshold values used are:…”

Section: Threshold Valuesmentioning

confidence: 99%

Ensemble Variable Selection for Naive Bayes to Improve Customer Behaviour Analysis

SivaSubramanian¹,

Prabha²

2022

Computer Systems Science and Engineering

View full text Add to dashboard Cite

Executing customer analysis in a systemic way is one of the possible solutions for each enterprise to understand the behavior of consumer patterns in an efficient and in-depth manner. Further investigation of customer patterns helps the firm to develop efficient decisions and in turn, helps to optimize the enterprise's business and maximizes consumer satisfaction correspondingly. To conduct an effective assessment about the customers, Naive Bayes(also called Simple Bayes), a machine learning model is utilized. However, the efficacious of the simple Bayes model is utterly relying on the consumer data used, and the existence of uncertain and redundant attributes in the consumer data enables the simple Bayes model to attain the worst prediction in consumer data because of its presumption regarding the attributes applied. However, in practice, the NB premise is not true in consumer data, and the analysis of these redundant attributes enables simple Bayes model to get poor prediction results. In this work, an ensemble attribute selection methodology is performed to overcome the problem with consumer data and to pick a steady uncorrelated attribute set to model with the NB classifier. In ensemble variable selection, two different strategies are applied: one is based upon data perturbation (or homogeneous ensemble, same feature selector is applied to a different subsamples derived from the same learning set) and the other one is based upon function perturbation (or heterogeneous ensemble different feature selector is utilized to the same learning set). Furthermore, the feature set captured from both ensemble strategies is applied to NB individually and the outcome obtained is computed. Finally, the experimental outcomes show that the proposed ensemble strategies perform efficiently in choosing a steady attribute set and increasing NB classification performance efficiently.

show abstract

Determining Threshold Value on Information Gain Feature Selection to Increase Speed and Prediction Accuracy of Random Forest

Cited by 3 publications

References 26 publications

Design of Higher Education System Based on Artificial Intelligence Technology

Design of Higher Education System Based on Artificial Intelligence Technology

Automatically Correcting Noisy Labels for Improving Quality of Training Set in Domain-specific Sentiment Classification

Ensemble Variable Selection for Naive Bayes to Improve Customer Behaviour Analysis

Contact Info

Product

Resources

About