Topic models and neural networks can discover meaningful low-dimensional latent representations of text corpora; as such, they have become a key technology of document representation. However, such models presume all documents are non-discriminatory, resulting in latent representation dependent upon all other documents and an inability to provide discriminative document representation. To address this problem, we propose a semi-supervised manifold-inspired autoencoder to extract meaningful latent representations of documents, taking the local perspective that the latent representation of nearby documents should be correlative. We first determine the discriminative neighbors set with Euclidean distance in observation spaces. Then, the autoencoder is trained by joint minimization of the Bernoulli cross-entropy error between input and output and the sum of the square error between neighbors of input and output. The results of two widely used corpora show that our method yields at least a 15% improvement in document clustering and a nearly 7% improvement in classification tasks compared to comparative methods. The evidence demonstrates that our method can readily capture more discriminative latent representation of new documents. Moreover, some meaningful combinations of words can be efficiently discovered by activating features that promote the comprehensibility of latent representation.
Having a system to stratify individuals according to risk is key to clinical disease prevention. This allows individuals identified at different risk tiers to benefit from further investigation and intervention. But the same risk score estimated for two different persons does not mean they need the same further investigation or represent the similarity health condition between two persons. Meanwhile, users still do not know a prior what most of the risk tiers are, and how many tiers should be found in risk stratification. In this paper, the proposed pairwise and size constrained Kmeans (PSCKmeans) method simultaneously integrates the limited supervised information and the size constraints to screen the high-risk population based on similarity measurement, and gets a feasible and balanced stratification solution to avoid cluster with few points. Results on China Health and Nutrition Survey public dataset and follow-up dataset show that the proposed PSCKmeans method can naturally grade the risk of diabetes into four tiers, and achieve 73.8%, 85.1%, and 0.95% sensitivity, specificity, and ratio of minimum to expected on testing data. The proposed method compares favorably with eight previous semisupervised clustering methods; it demonstrates that semisupervised clustering by unifying multiple forms of constraints can guide a good partition that is more relevant for the domain and find new categories through prior knowledge. Finally, this risk stratification model can provide a tool for risk stratification of clinical disease and be used for further intervention for people with similar health condition.
Laser point cloud filtering is a fundamental step in various applications of light detection and ranging (LiDAR) data. The progressive triangulated irregular network (TIN) densification (PTD) filtering algorithm is a classic method and is widely used due to its robustness and effectiveness. However, the performance of the PTD filtering algorithm depends on the quality of the initial TIN-based digital terrain model (DTM). The filtering effect is also limited by the tuning of a number of parameters to cope with various terrains. Therefore, an improved PTD filtering algorithm based on a multiscale cylindrical neighborhood (PTD-MSCN) is proposed and implemented to enhance the filtering effect in complex terrains. In the PTD-MSCN algorithm, the multiscale cylindrical neighborhood is used to obtain and densify ground seed points to create a high-quality DTM. By linearly decreasing the radius of the cylindrical neighborhood and the distance threshold, the PTD-MSCN algorithm iteratively finds ground seed points and removes object points. To evaluate the performance of the proposed PTD-MSCN algorithm, it was applied to 15 benchmark LiDAR datasets provided by the International Society for Photogrammetry and Remote Sensing (ISPRS) commission. The experimental results indicated that the average total error can be decreased from 5.31% when using the same parameter set to 3.32% when optimized. Compared with five other publicized PTD filtering algorithms, the proposed PTD-MSCN algorithm is not only superior in accuracy but also more robust.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.