Abstract:Abstract. The paper proposes the use of the Silhouette Coefficient (SC) as a ranking measure to perform instance selection in text classification. Our selection criterion was to keep instances with mid-range SC values while removing the instances with high and low SC values. We evaluated our hypothesis across three well-known datasets and various machine learning algorithms. The results show that our method helps to achieve the best trade-off between classification accuracy and training time.
“…I(x,y) represents the mutual information between x and y, H(x) and H(y) are the entropy of x and y. NMI is defined as shown in Eq. ( 14): SC is another evaluation index of clustering results, originally proposed by Peter J. Rousseeuw in 1986 37 . It combines the two factors of intra cluster and inter-cluster, which can be calculated as shown in Eqs.…”
Aiming at the problems of long time, high cost, invasive sampling damage, and easy emergence of drug resistance in lung cancer gene detection, a reliable and non-invasive prognostic method is proposed. Under the guidance of weakly supervised learning, deep metric learning and graph clustering methods are used to learn higher-level abstract features in CT imaging features. The unlabeled data is dynamically updated through the k-nearest label update strategy, and the unlabeled data is transformed into weak label data and continue to update the process of strong label data to optimize the clustering results and establish a classification model for predicting new subtypes of lung cancer imaging. Five imaging subtypes are confirmed on the lung cancer dataset containing CT, clinical and genetic information downloaded from the TCIA lung cancer database. The successful establishment of the new model has a significant accuracy rate for subtype classification (ACC = 0.9793), and the use of CT sequence images, gene expression, DNA methylation and gene mutation data from the cooperative hospital in Shanxi Province proves the biomedical value of this method. The proposed method also can comprehensively evaluate intratumoral heterogeneity based on the correlation between the final lung CT imaging features and specific molecular subtypes.
“…I(x,y) represents the mutual information between x and y, H(x) and H(y) are the entropy of x and y. NMI is defined as shown in Eq. ( 14): SC is another evaluation index of clustering results, originally proposed by Peter J. Rousseeuw in 1986 37 . It combines the two factors of intra cluster and inter-cluster, which can be calculated as shown in Eqs.…”
Aiming at the problems of long time, high cost, invasive sampling damage, and easy emergence of drug resistance in lung cancer gene detection, a reliable and non-invasive prognostic method is proposed. Under the guidance of weakly supervised learning, deep metric learning and graph clustering methods are used to learn higher-level abstract features in CT imaging features. The unlabeled data is dynamically updated through the k-nearest label update strategy, and the unlabeled data is transformed into weak label data and continue to update the process of strong label data to optimize the clustering results and establish a classification model for predicting new subtypes of lung cancer imaging. Five imaging subtypes are confirmed on the lung cancer dataset containing CT, clinical and genetic information downloaded from the TCIA lung cancer database. The successful establishment of the new model has a significant accuracy rate for subtype classification (ACC = 0.9793), and the use of CT sequence images, gene expression, DNA methylation and gene mutation data from the cooperative hospital in Shanxi Province proves the biomedical value of this method. The proposed method also can comprehensively evaluate intratumoral heterogeneity based on the correlation between the final lung CT imaging features and specific molecular subtypes.
“…This preprocessing type is known as instance selection. The silhouette coefficient (Dey et al 2011) was used as the criterion for detecting potentially noisy signals:…”
This paper describes a machine learning solution for the detection of defective embedded bearings in home appliances by sound analysis. The bearings are installed deep into the home appliances at the beginning of the production process and cannot be physically accessed once they are fully assembled. Before a home appliance is put to sale, it is turned on and passed through a sound-based sensor that produces an acoustic signal. Home appliances with defective embedded bearings are detected by analyzing such signals. The approached task is very challenging, mainly because there is a small number of sample signals and the noise level in the measurements is quite high. In fact, it is showed that the signal-to-noise ratio is high enough to mask important components when applying traditional Fourier decomposition techniques. Hence, a different approach is needed. Experimental results are reported on both laboratory and production line signals. Despite the difficulty of the task, these results are encouraging. Several classification methods were evaluated and most of them achieved acceptable performance. An interesting finding is that, among the classifiers that showed better performance, some methods are highly intuitive and easy to implement. These methods are generally preferred in industry. The proposed solution is being implemented by the company which motivated this study.
“…This value is helpful in denoting the cohesiveness of the data in one cluster and the separation of data in one cluster from those in the other clusters. This coefficient has been used in text classification not only to analyze the quality of the clustering but also as a feature selection technique [Dey et al, 2011]. In clustering tasks, the SC is calculated for each of the documents in the clusters in order to evaluate the clustering solution.…”
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.