Multi-label classification of research articles using Word2Vec and identification of similarity threshold

Mustafa, Ghulam; Usman, Muhammad; Yu, Lisu; Afzal, Muhammad Tanvır; Sulaiman, Muhammad; Shahid, Abdul

doi:10.1038/s41598-021-01460-7

Cited by 21 publications

(14 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The smaller the hamming loss, the better the model performance . We also used average precision and average recall, where the partially correct concept is considered to calculate the average for all the samples. , We used the Exact Match Accuracy, where the result would be considered correct when the predicted set of labels exactly matches the true label for each sample. , We also calculated the AUROC score for each PFAS and calculated the average AUROC for each multilabel model (equation in SI Table S6). The development and evaluation of ML models were coded with the sklearn, xgboost, CatBoost, lightgbm, and PyTorch (TabNet) packages in Python.…”

Section: Methodsmentioning

confidence: 99%

“…42 We also used average precision and average recall, where the partially correct concept is considered to calculate the average for all the samples. 43,44 We used the Exact Match Accuracy, where the result would be considered correct when the predicted set of labels exactly matches the true label for each sample. 45,42 We also calculated the AUROC score for each PFAS and calculated the average AUROC for each multilabel model (equation in SI Table S6).…”

Section: Data Preprocessing To Train a Machine Learning (Ml) Modelmentioning

confidence: 99%

See 1 more Smart Citation

Prediction of 35 Target Per- and Polyfluoroalkyl Substances (PFASs) in California Groundwater Using Multilabel Semisupervised Machine Learning

Dong,

Tsai,

Olivares

2023

ACS EST Water

View full text Add to dashboard Cite

Comprehensive monitoring of perfluoroalkyl and polyfluoroalkyl substances (PFASs) is challenging because of the high analytical cost and an increasing number of analytes. We developed a machine learning pipeline to understand environmental features influencing PFAS profiles in groundwater. By examining 23 public data sets (2016−2022) in California, we built a state-wide groundwater database (25,000 observations across 4200 wells) encompassing contamination sources, weather, air quality, soil, hydrology, and groundwater quality (PFASs and cocontaminants). We used supervised learning to prescreen total PFAS concentrations above 70 ng/L and multilabel semisupervised learning to predict 35 individual PFAS concentrations above 2 ng/L. Random forest with ADASYN oversampling performed the best for total PFASs (AUROC 99%). XGBoost with SMOTE oversampling achieved the AUROC of 73−100% for individual PFAS prediction. Contamination sources and soil variables contributed the most to accuracy. Individual PFASs were strongly correlated within each PFAS's subfamily (i.e., short-vs long-chain PFCAs, sulfonamides). These associations improved prediction performance using classifier chains, which predicts a PFAS based on previously predicted species. We applied the model to reconstruct PFAS profiles in groundwater wells with missing data in previous years. Our approach can complement monitoring programs of environmental agencies to validate previous investigation results and prioritize sites for future PFAS sampling.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Data Preprocessing To Train a Machine Learning (Ml) Modelmentioning

confidence: 99%

Prediction of 35 Target Per- and Polyfluoroalkyl Substances (PFASs) in California Groundwater Using Multilabel Semisupervised Machine Learning

Dong,

Tsai,

Olivares

2023

ACS EST Water

View full text Add to dashboard Cite

show abstract

“…Once documents are tokenized, a text feature extraction method is applied to obtain the most distinguishing features of a text, reducing dimensionality [35]- [38]. Some of the broadly feature extraction techniques in research article classification approaches are: 1) One Hot Encoding, 2) Bag of Word (BOW) or Term Frequency (TF), and 3) Term Frequency and Inverse Document Frequency (TF-IDF), and semantic based approaches are: 1) Glove, 2) FastText, and 3) Word2Vec [24].…”

Section: Related Workmentioning

confidence: 99%

A Comparison of Multi-Label Text Classification Models in Research Articles Labeled With Sustainable Development Goals

Morales-Hernandez

Gutiérrez²,

Becerra-Alonso

2022

IEEE Access

View full text Add to dashboard Cite

The classification of scientific articles aligned to Sustainable Development Goals is crucial for research institutions and universities when assessing their influence in these areas. Machine learning enables the implementation of massive text data classification tasks. The objective of this study is to apply Natural Language Processing techniques to articles from peer-reviewed journals to facilitate their classification according to the 17 Sustainable Development Goals of the 2030 Agenda. This article compares the performance of multi-label text classification models based on a proposed framework with datasets of different characteristics. Results reveal that a particular combination of a transformation method with a classifier algorithm dominates the performance results.

show abstract

“…In multi-label text classification, the goal is to associate one or more labels to the input text. It is an important task that has applications in many tasks such as research article classification and metadata generation from documents [Mustafa et al 2021;Sajid et al 2011] that can be used for optimizing search engine indexing.…”

Section: Introductionmentioning

confidence: 99%

Exploiting Label Dependencies for Multi-Label Document Classification Using Transformers

Fallah

Bruno

Bellot

et al. 2023

Proceedings of the ACM Symposium on Document Engineering 2023

View full text Add to dashboard Cite

We introduce in this paper a new approach to improve deep learningbased architectures for multi-label document classification. Dependencies between labels are an essential factor in the multi-label context. Our proposed strategy takes advantage of the knowledge extracted from label co-occurrences. The proposed method consists in adding a regularization term to the loss function used for training the model, in a way that incorporates the label similarities given by the label co-occurrences to encourage the model to jointly predict labels that are likely to co-occur, and and not consider labels that are rarely present with each other. This allows the neural model to better capture label dependencies. Our approach was evaluated on three datasets: the standard AAPD dataset, a corpus of scientific abstracts and Reuters-21578, a collection of news articles, and a newly proposed multi-label dataset called arXiv-ACM. Our method demonstrates improved performance, setting a new state-of-the-art on all three datasets. CCS CONCEPTS• Applied computing → Document metadata; Digital libraries and archives; • Information systems → Digital libraries and archives; Content analysis and feature selection; Document collection models; • Computing methodologies → Neural networks.

show abstract

Multi-label classification of research articles using Word2Vec and identification of similarity threshold

Cited by 21 publications

References 31 publications

Prediction of 35 Target Per- and Polyfluoroalkyl Substances (PFASs) in California Groundwater Using Multilabel Semisupervised Machine Learning

Prediction of 35 Target Per- and Polyfluoroalkyl Substances (PFASs) in California Groundwater Using Multilabel Semisupervised Machine Learning

A Comparison of Multi-Label Text Classification Models in Research Articles Labeled With Sustainable Development Goals

Exploiting Label Dependencies for Multi-Label Document Classification Using Transformers

Contact Info

Product

Resources

About