An improved algorithm of TFIDF combined with Naive Bayes

Zhang, Zhe; Wu, Zeyi; Shi, Zhiwei

doi:10.1145/3517077.3517104

Cited by 6 publications

(2 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Data mining is the process of extracting potentially useful information and knowledge from a large amount of incomplete, noisy, fuzzy, and random data. Data mining algorithms are broadly categorized into classification algorithms: the C4.5 algorithm is based on information theory and uses information entropy and information gain degree as the measure to achieve inductive classification of the data [21]; the Plain Bayesian algorithm is based on Bayes' theorem with the assumption of conditional independence of features as the classification method [22]; Support Vector Machines (SVMs) [23] map the points of the lowdimensional space into the high-dimensional space so that they become linearly divided and then use the principle of linear division to determine classification boundaries; such an approach also includes the K Nearest Neighbor classification algorithm (KNN) and Adaboost. The K-Means algorithm, when given a set of samples, divides the sample set into K clusters according to the size of the distance between the samples; each object is assigned to the closest clustering center [24]; The EM maximum expectation algorithm is an algorithm used for finding the maximum likelihood estimates of parameters in a probabilistic model, where the probabilistic model relies on unobservable hidden variables.…”

Section: Data Mining Algorithmmentioning

confidence: 99%

Analysis of Key Injury-Causing Factors of Object Strike Incident in Construction Industry Based on Data Mining Method

Yang,

2023

Sustainability

View full text Add to dashboard Cite

Incidents are caused by a variety of factors, and there are correlations between incident causative factors. How to effectively clarify the importance of incidental injury-causing factors and their correlations is the current technical challenge in the field of incident causation analysis. This paper takes the study of injury-causing factors and their relationships between object-striking incidents in the process of construction as an example, and it statistically analyzes the incident investigation reports of 126 cases of object-striking incidents in construction projects in China from 2016 to 2022; it screens out 52 categories of incident-causing factors. The Apriori algorithm and FP-growth algorithm are used to data mine the influencing factors obtained from the 126 object-striking incidents: 28 main incident causative items of object-striking incidents and the respective correlation degree between each factor are obtained. By analyzing the support degree of the main incident causation items, as well as comparing and analyzing the results of the incident causation support degree and association rules with Bayesian inference, 9 key injury-causing factors of object-striking incidents are identified. The research results put forward a new research idea for the analysis of the injury factors of object-striking incidents in construction, which can provide theoretical reference for improving the pertinence and effectiveness of incident prevention measures.

show abstract

Section: Data Mining Algorithmmentioning

confidence: 99%

Analysis of Key Injury-Causing Factors of Object Strike Incident in Construction Industry Based on Data Mining Method

Yang,

2023

Sustainability

View full text Add to dashboard Cite

show abstract

“…Yang, Z. et al [22] proposed Hierarchical attention networks (HAN) for document classification, which maintain a hierarchical structure of word to sentence (building sentence from words) and sentence to document (aggregating sentences to a document representation). Zhang, Z. et al [23] proved that the TFIDF algorithm with the combination of Naive Bayes has significance in the text classification task compared to many complex models.…”

Section: Literature Reviewmentioning

confidence: 99%

Corpus Statistics Empowered Document Classification

et al. 2022

View full text Add to dashboard Cite

In natural language processing (NLP), document classification is an important task that relies on the proper thematic representation of the documents. Gaussian mixture-based clustering is widespread for capturing rich thematic semantics but ignores emphasizing potential terms in the corpus. Moreover, the soft clustering approach causes long-tail noise by putting every word into every cluster, which affects the natural thematic representation of documents and their proper classification. It is more challenging to capture semantic insights when dealing with short-length documents where word co-occurrence information is limited. In this context, for long texts, we proposed Weighted Sparse Document Vector (WSDV), which performs clustering on the weighted data that emphasizes vital terms and moderates the soft clustering by removing outliers from the converged clusters. Besides the removal of outliers, WSDV utilizes corpus statistics in different steps for the vectorial representation of the document. For short texts, we proposed Weighted Compact Document Vector (WCDV), which captures better semantic insights in building document vectors by emphasizing potential terms and capturing uncertainty information while measuring the affinity between distributions of words. Using available corpus statistics, WCDV sufficiently handles the data sparsity of short texts without depending on external knowledge sources. To evaluate the proposed models, we performed a multiclass document classification using standard performance measures (precision, recall, f1-score, and accuracy) on three long- and two short-text benchmark datasets that outperform some state-of-the-art models. The experimental results demonstrate that in the long-text classification, WSDV reached 97.83% accuracy on the AgNews dataset, 86.05% accuracy on the 20Newsgroup dataset, and 98.67% accuracy on the R8 dataset. In the short-text classification, WCDV reached 72.7% accuracy on the SearchSnippets dataset and 89.4% accuracy on the Twitter dataset.

show abstract