Tuning the Turkish Text Classification Process Using Supervised Machine Learning-based Algorithms

Köksal, Ömer

doi:10.1109/inista49547.2020.9194669

Cited by 11 publications

(13 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Also, tuning BOW size is one of the most effective methods to improve classification accuracy. 9 To investigate the effect of BOW size, we perform detailed tests using BOW size starting from 100, 500, and 1000 to 500,000 incrementing 1000 applying all nine classifiers on the dataset. Figures 5 and 6 show the F1-score variation for the different BOW sizes for the TTC-3600 and TTC-4900 datasets.…”

Section: Bow Sizementioning

confidence: 99%

See 1 more Smart Citation

Improving automated Turkish text classification with learning‐based algorithms

Köksal

Yılmaz

2022

Concurrency and Computation

View full text Add to dashboard Cite

Text classification is the process of determining categories or tags of a document depending on its content. Although text classification is a well‐known process, it has many steps that require tuning to improve mathematical models. This article provides a novel methodology and expresses key points to improve text classification performance using learning‐based algorithms and techniques. First, to check the effectiveness of the proposed methodology, we selected two public Turkish news benchmarking datasets. Then, we performed extensive testing using both supervised machine learning algorithms and state‐of‐art pre‐trained language models. The experimental results show that our methodology outperforms previous news classification studies on these benchmarking datasets improving categorization results based on F1‐score. Therefore, we conclude that the presented methodology efficiently improves the classification results and selects the feasible classifier for a given dataset.

show abstract

Section: Bow Sizementioning

confidence: 99%

“…This article is the extended version of our previous study. 9 In this study, we classify two public Turkish news datasets used for benchmarking in this domain, namely TTC-3600 10 and TTC-4900 11 datasets. These datasets include Turkish news data and are widely used as benchmarking datasets in several studies.…”

mentioning

confidence: 99%

Improving automated Turkish text classification with learning‐based algorithms

Köksal

Yılmaz

2022

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…Although some researchers apply different parsing techniques for multilingual data [22], we use two-step lemmatization and two-step removal of the stop words, as shown in Figure 4. Details of this process are given in our previous study [23]. Since we are dealing with two different languages (Turkish, English), we have two different steps to eliminate stop words and two different steps for lemmatizing.…”

Section: Effect Of Preprocessing On Bug Classificationmentioning

confidence: 99%

Automated Classification of Unstructured Bilingual Software Bug Reports: An Industrial Case Study Research

Köksal

Tekinerdoğan

2021

Applied Sciences

View full text Add to dashboard Cite

Software bug report classification is a critical process to understand the nature, implications, and causes of software failures. Furthermore, classification enables a fast and appropriate reaction to software bugs. However, for large-scale projects, one must deal with a broad set of bugs from multiple types. In this context, manually classifying bugs becomes cumbersome and time-consuming. Although several studies have addressed automated bug classification using machine learning techniques, they have mainly focused on academic case studies, open-source software, and unilingual text input. This paper presents our automated bug classification approach applied and validated in an industrial case study. In contrast to earlier studies, our study is applied to a commercial software system based on unstructured bilingual bug reports written in English and Turkish. The presented approach adopts and integrates machine learning (ML), text mining, and natural language processing (NLP) techniques to support the classification of software bugs. The approach has been applied within an industrial case study. Compared to manual classification, our results show that bug classification can be automated and even performs better than manual bug classification. Our study shows that the presented approach and the corresponding tools effectively reduce the manual classification time and effort.

show abstract

“…69 Examples of setups and applications are (but not limited to) social media, 70 healthcare, [71][72][73] information retrieval, 74 sentiment analysis, [75][76][77][78][79] content-based recommender systems, 80 document summarization, 81,82 various business and marketing applications, [83][84][85] and legal document categorization. 86 A variety of languages were targeted over time for the popular text classification task, including well-studied languages, such as Arabic, 87,88 Turkish, 83,[89][90][91] French, 71,92 Spanish, 72 and Indian, 93 as well as underresourced languages, such as Romanian. 94 The applied classification techniques range from shallow methods, such as Logistic Regression, 95 SVM, 96 and Naïve Bayes, 97 to more complex and resource-hungry deep neural networks, such as CNNs, 62,98 Hierarchical Attention Networks (HANs), 99 and the powerful transformer-based methods that started to dominate the landscape in recent years.…”

Section: Text Classificationmentioning

confidence: 99%

The unreasonable effectiveness of machine learning in Moldavian versus Romanian dialect identification

Găman

Ionescu

2021

Int J of Intelligent Sys

View full text Add to dashboard Cite

Motivated by the seemingly high accuracy levels of machine learning (ML) models in Moldavian versus Romanian dialect identification and the increasing research interest on this topic, we provide a follow-up on the Moldavian versus Romanian Cross-Dialect Topic Identification (MRC) shared task of the VarDial 2019 evaluation campaign. The shared task included two subtask types: one that consisted in discriminating between the Moldavian and Romanian dialects and one that consisted in classifying documents by topic across the two dialects of Romanian. Participants achieved impressive scores, for example, the top model for Moldavian versus Romanian dialect identification obtained a macro-F 1 score of 0.895. We conduct a subjective evaluation by human annotators, showing that humans attain much lower accuracy rates compared with ML models. Hence, it remains unclear why the methods proposed by participants attain such high accuracy rates. Our goal is to understand (i) why the proposed methods work so well (by visualizing the discriminative features) and (ii) to what extent these methods can keep their high accuracy levels, for example, when we shorten the text samples to single sentences or when we use tweets at inference time. A secondary goal of our

show abstract

Tuning the Turkish Text Classification Process Using Supervised Machine Learning-based Algorithms

Cited by 11 publications

References 14 publications

Improving automated Turkish text classification with learning‐based algorithms

Improving automated Turkish text classification with learning‐based algorithms

Automated Classification of Unstructured Bilingual Software Bug Reports: An Industrial Case Study Research

The unreasonable effectiveness of machine learning in Moldavian versus Romanian dialect identification

Contact Info

Product

Resources

About