Abstract:Information filtering and information retrieving applications are based on web page classification methods. Usually, web pages serve different functionalities or develop different topics or subjects. The diversity of web page content increases the need for automatic web page classification, making it a challenging task at the same time. Considering that the main component of the content of a web page is most often represented by the text and the classification of the text is a problem intensively studied in th… Show more
“… seven studies that used a combination of HTML tag structure and text content, as shown in [10], [11], [19], [20], [24], [25], [27] six studies that used images as shown in [9], [12]- [14], [16], [17] three studies that used the feature of HTML tags structure as shown in [8], [15], [28] two studies that used each feature of text content as shown in [22], [23] two studies that used URL features as shown in [21],…”
The internet is frequently surfed by people by using smartphones, laptops, or computers in order to search information online in the web. The increase of information in the web has made the web pages grow day by day. The automatic topic-based web page classification is used to manage the excessive amount of web pages by classifying them to different categories based on the web page content. Different machine learning algorithms have been employed as web page classifiers to categorise the web pages. However, there is lack of study that review classification of web pages using deep learning. In this study, the automatic topic-based classification of web pages utilising deep learning that has been proposed by many key researchers are reviewed. The relevant research papers are selected from reputable research databases. The review process looked at the dataset, features, algorithm, pre-processing used in classification of web pages, document representation technique and performance of the web page classification model. The document representation technique used to represent the web page features is an important aspect in the classification of web pages as it affects the performance of the web page classification model. The integral web page feature is the textual content. Based on the review, it was found that the image based web page classification showed higher performance compared to the text based web page classification. Due to lack of matrix representation that can effectively handle long web page text content, a new document representation technique which is word cloud image can be used to visualize the words that have been extracted from the text content web page.
“… seven studies that used a combination of HTML tag structure and text content, as shown in [10], [11], [19], [20], [24], [25], [27] six studies that used images as shown in [9], [12]- [14], [16], [17] three studies that used the feature of HTML tags structure as shown in [8], [15], [28] two studies that used each feature of text content as shown in [22], [23] two studies that used URL features as shown in [21],…”
The internet is frequently surfed by people by using smartphones, laptops, or computers in order to search information online in the web. The increase of information in the web has made the web pages grow day by day. The automatic topic-based web page classification is used to manage the excessive amount of web pages by classifying them to different categories based on the web page content. Different machine learning algorithms have been employed as web page classifiers to categorise the web pages. However, there is lack of study that review classification of web pages using deep learning. In this study, the automatic topic-based classification of web pages utilising deep learning that has been proposed by many key researchers are reviewed. The relevant research papers are selected from reputable research databases. The review process looked at the dataset, features, algorithm, pre-processing used in classification of web pages, document representation technique and performance of the web page classification model. The document representation technique used to represent the web page features is an important aspect in the classification of web pages as it affects the performance of the web page classification model. The integral web page feature is the textual content. Based on the review, it was found that the image based web page classification showed higher performance compared to the text based web page classification. Due to lack of matrix representation that can effectively handle long web page text content, a new document representation technique which is word cloud image can be used to visualize the words that have been extracted from the text content web page.
“…In the same context of multi-label classification, Artene et al [51] used a CNN for multi-label multi-language classification. This study is an extension of their work in 2021 [52].…”
“…In their first study in 2021 [52], their CNN model achieved a micro F1 score of 0.79. In their second work in 2022 [51], they divided the classification problem into two problems: functional classification and subject classification, and they increased the total dataset to 12,432 webpages to improve the results. The F1 scores for functional, subject, and all (functional + subject) were 0.88, 0.84, and 0.74, respectively.…”
“…In this study, we used the pre-trained BERT and ALBERT models to fine-tune our downstream text classification task (legitimacy classification task). For the legitimacy criteria violations detection task, we evaluated CNN as a deep learning model based on the results obtained in the first task, and because CNN achieved good results for multilabel classification, as mentioned in the literature section in [51].…”
Predatory publishing venues publish questionable articles and pose a global threat to the integrity and quality of the scientific literature. They have given rise to the dark side of scholarly publishing and their effects have reached political, societal, economic, and health aspects. Given their consequences and proliferation, several solutions have been developed to help detect them; however, these solutions are manual and time-consuming. While researchers, students, and readers are in need of a tool that automatically detects predatory venues and their violations, in this study, we proposed an intelligent framework that can automatically detect predatory venues and their violations using different artificial intelligence techniques. This work contributes through the following: (1) creating a dataset of 9,866 journals annotated as predatory and legitimate, and (2) proposing an intelligent framework for classifying a venue as legitimate or predatory, with appropriate reasoning. Our framework was evaluated using seven different machine learning and deep learning models, including Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Neural Networks (NNs), Long short-term memory (LSTM), Convolutional Neural Network (CNN), Bidirectional Encoders from Transformers (BERT), A Lite BERT (ALBERT), and different feature representation techniques. The results showed that the CNN model outperformed the other models in journal classification task, with an F1 score of 0.96. For appropriate reasoning of the provisioning task, the SVM model achieved the best micro F1 of 0.67.
“…Both types of ancient glass have been found in archaeological sites around the world, and both have been used for a variety of purposes. Ancient glass is an important part for ours to investigate the past and the way people lived in different cultures [3].…”
Ancient glasses sub-classification is an essential component for archaeological research and assist researchers to divide the ancient glasses. However, existing classification models are concentrated on life-related area and ignore the ancient glass. Subsequently, existing classification models are almost utilizing machine learning algorithm, which is a black-box model leaks the reasonable explanation and mathematical calculation procedures for the classification results and may lead low accuracy when learning a novel area. Therefore, we propose a novel classification model by utilizing clustering method to dispose the issue about ancient glass sub-classification. In this paper, we aim to sub-classify weathered ancient glass according to its chemical composition distribution. We collect 61 sets of sample data, each corresponding to fourteen chemical compositions, and use the k-means algorithm to sub-classify these sample glasses. Next, the cohesiveness and separability of the results are evaluated using the contour coefficient method and the model accuracy is checked using the elbow method .Finally, the elbow method is suitably improved.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.