A novel oversampling method based on SeqGAN for imbalanced text classification

Luo, Yin; Feng, Haishan; Weng, Xuanlong; Huang, Ke; Zheng, Huang

doi:10.1109/bigdata47090.2019.9006138

Cited by 7 publications

(2 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…From the original dataset, to prepare data for oversampling, the data is manually observed to seek for a patch of data that contains highest frequency of small quantity class distributions like "action", "substance", and "material" to gain more weightage for these classes. This study chooses 700 data in sequence as the patch data so that the training will be more accurate as in Balakrishnan and Lloyd-Yemoh [23] and Luo et al, [29]. The data is from data number 700 to 1400 which is partitioned into 563 training data and 140 test data.…”

Section: The Steps Of Hmm Constructionmentioning

confidence: 99%

Named Entity Recognition of an Oversampled and Preprocessed Manufacturing Data Corpus

Nurul Hannah Mohd Yusof,

Nurul Adilla Mohd Subha,

Nurulaqilla Khamis

et al. 2023

ARASET

View full text Add to dashboard Cite

In recent manufacturing industry, improving the manufacturing process is of paramount importance. One area that holds great potential for enhancement is the application and manipulation of maintenance data. By effectively leveraging this data, manufacturers can optimize maintenance schedules, leading to increased efficiency, reduced costs, and minimized downtime. However, the challenge lies in handling vast amounts of maintenance data that often come in various formats, making it difficult to extract valuable insights. Without proper analysis, this unprocessed data can result in unforeseen issues, costly disruptions, and extended downtime periods. To overcome this obstacle, modern manufacturing companies are turning to advanced technologies such as language modelling, text classification, machine translation, and Named Entity Recognition (NER). To the best of our knowledge, no investigation has been conducted to assess the impact of text preprocessing on NER performance. Improving the initial stage of NER, such as text preprocessing, can enhance NER performance which leads to the training model’s efficiency performance. In this study, Hidden Markov Model (HMM) is employed to improve NER performance by utilizing oversampling and text preprocessing techniques. The study is performed without IOB labelling and consider seven specific entities and the preprocessing text tasks include tokenization, lemmatization, erase punctuation, stop words removal, and elimination of long and short words. As a result, HMM for NER with oversampling and preprocessed text outperformed the one without any of both by 20.10% and 27.59%, respectively, due to consideration of significant classes and words among the entity classes in preprocessed factory reports. This finding highlights the importance of text preprocessing method selection in NER and its capability to optimize maintenance schedule and reduce downtime.

show abstract

Section: The Steps Of Hmm Constructionmentioning

confidence: 99%

Named Entity Recognition of an Oversampled and Preprocessed Manufacturing Data Corpus

Nurul Hannah Mohd Yusof,

Nurul Adilla Mohd Subha,

Nurulaqilla Khamis

et al. 2023

ARASET

View full text Add to dashboard Cite

show abstract

“…Методы балансировки подразделяются на три типа, а именно: сокращение количества объектов мажоритарного класса (undersampling), увеличение количества объектов миноритарного класса (oversampling) и гибридные методы. Первый подход подразумевает исключение некоторых данных мажоритарного класса (см., например, [18][19]); второй предполагает воспроизведение существующих экземпляров миноритарного класса либо создание новых [20][21][22], а гибридные методы направлены на объединение преимуществ обоих подходов [23]. Локальные и глобальные контексты учитываются современными архитектурами нейронных сетей для анализа текста.…”

Section: Introductionunclassified

Text sampling strategies for predicting missing bibliographic links

Краснов¹,

Смазневич²,

Baskakova³

2022

Proceedings of ISP RAS

View full text Add to dashboard Cite

The paper proposes various strategies for sampling text data when performing automatic sentence classification for the purpose of detecting missing bibliographic links. We construct samples based on sentences as semantic units of the text and add their immediate context which consists of several neighbouring sentences. We examine a number of sampling strategies that differ in context size and position. The experiment is carried out on the collection of STEM scientific papers. Including the context of sentences into samples improves the result of their classification. We automatically determine the optimal sampling strategy for a given text collection by implementing an ensemble voting when classifying the same data sampled in different ways. Sampling strategy taking into account the sentence context with hard voting procedure leads to the classification accuracy of 98% (F1-score). This method of detecting missing bibliographic links can be used in recommendation engines of applied intelligent information systems. Keywords: text sampling, sampling strategy, citation analysis, bibliographic link prediction, sentence classification.

show abstract