Öz Son yıllarda dengesiz tıbbi veri kümeleri üzerinde gerçekleştirilen öğrenme problemine verilen önem artmaktadır. Çünkü gerçek yaşamda karşılaşılan tıbbi veri kümeleri sıklıkla dengesiz veri kümeleridir. Sınıflandırıcıların dengesiz ortamdaki davranışlarını inceleyen pek çok çalışma, başarım değerlerindeki önemli kaybın veri kümelerinde oluşan çarpık sınıf dağılımından kaynaklandığını vurgulamıştır. Literatürde, bu çarpıklık sorununu çözmek için Sentetik Azınlık Örneklem Arttırma Yöntemi (SMOTE) algoritması önerilmiştir. Bu çalışmada, hastanelere yapılan şüpheli bir Covid-19 vaka başvurusunda, yaygın olarak toplanan laboratuvar test sonuçlarına dayanarak, SARS-Cov-2 test sonucu negatif veya pozitif sınıfa sahip hastaları SMOTE ve YSA modeli kullanarak daha yüksek oranla tahmin etmeye yönelik deneysel çalışma yapılmıştır. Orijinal veri kümesinin YSA ile sınıflandırılması sonucunda doğruluk değeri 0.86, F-ölçüm değeri 0.48 bulunmuş olup, SMOTE ile dengelenen veri kümesinin yine YSA ile sınıflandırılması sonucunda doğruluk değeri 0.90, F-ölçüm değeri 0.68 bulunmuştur. Bu nedenle SMOTE ile dengelenmiş Covid-19 veri kümesinin YSA ile sınıflandırılması sonucunda daha başarılı sonuçlar bulunmuştur. Çalışmamızın sonunda orijinal ve SMOTE ile dengelenen veri kümesi arasında karşılaştırma yapılmış olup, sınıflandırıcının diğer başarım değerlerini de arttırdığı görülmüştür.
In this study, Turkish and English tweets through Twitter Application Program Interface (API) between 1-31 January 2021 are analyzed with respect to Covid-19. The collected tweets are preprocessed, labeled with the Vader Sentiment library, and then analyzed by topic modeling with Nonnegative Matrix Factorization. The analysis show that the most frequently mentioned word is “vaccine/aşı” after “Covid”. The topics modelled in the study are grouped into themes and the themes are seen to be similar in both languages, which means that the Turkish and world agenda are not very different in terms of themes in pandemics. Moreover, hypothesis tests are conducted to understand whether language and time period are related to sentiment class. The results show that the Turkish people are more neutral about the Covid-19 issue than other people in the world during the given period of time. Moreover, independent of the language, there are more negative and neutral tweets in the first half of January 2021, whereas there are more positive tweets in the second half of the month. To the best of our knowledge, this is the first study to analyze Covid-19 related tweets in two languages to compare the local and global agendas using topic modeling, sentiment analysis, and hypothesis testing methods.
In recent years, there have been great improvements in data classification processes using machine learning methods. As technological advances increase, the size of data in the internet and other environments also increases rapidly. With these developments, unbalanced and unclassified data has emerged. The problem of imbalance is that one of the two classes has fewer samples than the other. Most of the datasets, especially used in the medical field, have an unbalanced distribution. A dataset with unbalanced distribution negatively affects the performance of classification algorithms. Many studies have been conducted to balance and classify this distribution. These studies are at the data and algorithm level and are undersampling and oversampling processes. In this study, the existing samples belonging to the minority class were resampled synthetically, and the datasets were balanced. For the resampling process, among the samples belonging to the minority class, the closest neighbors were determined for all data points using the Euclidean distance metric. Based on these neighbors, the desired number of new synthetic samples were created between each sample using the Weighted Geometric Mean. As a result of this process, the dataset has been balanced. In addition, Random Undersampling (RUS), Random Oversampling (ROS), and Synthetic Minority Sampling Technique (SMOTE) methods are also used to balance the datasets. The raw and balanced datasets are classified using the Random Forest algorithm, and the results are compared. As a result of the study, an increase is observed in all performance values of the datasets balanced with the new resampling approach. Using the approach proposed in the study, it is shown that the balanced datasets using the new resampling method improve the classification performance compared to the raw dataset and other methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.