Analysis of preprocessing methods on classification of Turkish texts

Torunoglu, Dilara; Cakirman, Erhan; Ganiz, Murat Can; Akyokuş, Selim; Gürbüz, Mustafa Zahid

doi:10.1109/inista.2011.5946084

Cited by 49 publications

(30 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Reuters-21578 [3]- [10] and 20Newsgroups [5], [6] datasets, consisting of English text content, are widely used to provide a general evaluation related to applied methods. Datasets which are composed of different sources and languages such as e-mail [4], SMS [4], news text [11], [12], technical paper [9], medical journals [13] and chemical web pages [10] are used to reveal the effect of classification methods on the other languages. Datasets containing Turkish documents are limited in number and they www.ijacsa.thesai.org are not regarded as standard datasets yet.…”

Section: A Related Workmentioning

confidence: 99%

“…Datasets containing Turkish documents are limited in number and they www.ijacsa.thesai.org are not regarded as standard datasets yet. Some of them are as follows; 6-class 2 imbalanced datasets formed with news obtained from RSS source [11], and 5, 6 and 9-class 3 balanced datasets formed with columns and news [12]. Since there is not a standard dataset consisting of Turkish content, the evaluation of effects of the techniques on Turkish content cannot be done.…”

Section: A Related Workmentioning

confidence: 99%

“…Liu et al [9] used feature selection methods for term weighting in their studies. Furthermore, a feature selection may not have the same effect on all classification algorithms; a feature selection producing the best results for an algorithm may not necessarily produce the same results for another algorithm [12].…”

Section: A Related Workmentioning

confidence: 99%

“…The process of converting unstructured documents into structured form was completed with the numerical expression of terms in document as a result of weighting. This process starts with preprocessing and ends with term weighting and formation of document vectors [12].…”

Section: Term Weightingmentioning

confidence: 99%

See 3 more Smart Citations

Examining the Impact of Feature Selection Methods on Text Classification

Karaca¹,

Bayir²

2017

ijacsa

View full text Add to dashboard Cite

Abstract-Feature selection that aims to determine and select the distinctive terms representing a best document is one of the most important steps of classification. With the feature selection, dimension of document vectors are reduced and consequently duration of the process is shortened. In this study, feature selection methods were studied in terms of dimension reduction rates, classification success rates, and dimension reductionclassification success relation. As classifiers, kNN (k-Nearest Neighbors) and SVM (Support Vector Machines) were used. 5 standard (Odds Ratio-OR, Mutual Information-MI, Information Gain-IG, Chi-Square-CHI and Document Frequency-DF), 2 combined (Union of Feature Selections-UFS and Correlation of Union of Feature Selections-CUFS) and 1 new (Sum of Term Frequency-STF) feature selection methods were tested. The application was performed by selecting 100 to 1000 terms (with an increment of 100 terms) from each class. It was seen that kNN produces much better results than SVM. STF was found out to be the most successful feature selection considering the average values in both datasets. It was also found out that CUFS, a combined model, is the one that reduces the dimension the most, accordingly, it was seen that CUFS classify the documents more successfully with less terms and in short period compared to many of the standard methods.

show abstract

Section: A Related Workmentioning

confidence: 99%

Section: A Related Workmentioning

confidence: 99%

Section: A Related Workmentioning

confidence: 99%

Section: Term Weightingmentioning

confidence: 99%

See 2 more Smart Citations

Examining the Impact of Feature Selection Methods on Text Classification

Karaca¹,

Bayir²

2017

ijacsa

View full text Add to dashboard Cite

show abstract

“…In order to get good results, this step plays a very important role in our system. The impact of pre-processing in the field of text classification is extensively studied, and research on various languages like Arabic, Turkish, and Portuguese [14], [15], [16] support our motivation behind doing pre-processing at this step. It has already proven that preprocessing takes almost 80% of the total time in classification process [17].…”

Section: A Pre-processing Of Textmentioning

confidence: 99%

Study of Automatic Extraction, Classification, and Ranking of Product Aspects Based on Sentiment Analysis of Reviews

Rafi¹,

Farooq²,

Noman³

et al. 2015

ijacsa

View full text Add to dashboard Cite

Abstract-It is very common for a customer to read reviews about the product before making a final decision to buy it. Customers are always eager to get the best and the most objective information about the product theywish to purchase and reviews are the major source to obtain this information. Although reviews are easily accessible from the web, but since most of them carry ambiguous opinion and different structure, it is often very difficult for a customer to filter the information he actually needs. This paper suggests a framework, which provides a single user interface solution to this problem based on sentiment analysis of reviews. First, it extracts all the reviews from different websites carrying varying structure, and gathers information about relevant aspects of that product. Next, it does sentiment analysis around those aspects and gives them sentiment scores. Finally, it ranks all extracted aspects and clusters them into positive and negative class. The final output is a graphical visualization of all positive and negative aspects, which provide the customer easy, comparable, and visual information about the important aspects of the product. The experimental results on five different products carrying 5000 reviewsshow 78% accuracy. Moreover, the paper also explained the effect of Negation, Valence Shifter, and Diminisher with sentiment lexiconon sentiment analysis, andconcluded that they all are independent of the case problem, and have no effect on the accuracy of sentiment analysis.

show abstract

Improving automated Turkish text classification with learning‐based algorithms

Köksal

Yılmaz

2022

Concurrency and Computation

View full text Add to dashboard Cite

Text classification is the process of determining categories or tags of a document depending on its content. Although text classification is a well‐known process, it has many steps that require tuning to improve mathematical models. This article provides a novel methodology and expresses key points to improve text classification performance using learning‐based algorithms and techniques. First, to check the effectiveness of the proposed methodology, we selected two public Turkish news benchmarking datasets. Then, we performed extensive testing using both supervised machine learning algorithms and state‐of‐art pre‐trained language models. The experimental results show that our methodology outperforms previous news classification studies on these benchmarking datasets improving categorization results based on F1‐score. Therefore, we conclude that the presented methodology efficiently improves the classification results and selects the feasible classifier for a given dataset.

show abstract

Analysis of preprocessing methods on classification of Turkish texts

Cited by 49 publications

References 13 publications

Examining the Impact of Feature Selection Methods on Text Classification

Examining the Impact of Feature Selection Methods on Text Classification

Study of Automatic Extraction, Classification, and Ranking of Product Aspects Based on Sentiment Analysis of Reviews

Improving automated Turkish text classification with learning‐based algorithms

Contact Info

Product

Resources

About