Social media such as Twitter connect billions of people by allowing them to exchange their thoughts via short-text communication. Topic modelling is a widely used technique for analysing short texts. Discovering topic clusters in short-text collections faces issues with distance-based, density-based and dimensionality reduction-based methods due to their higher dimensionality and short length which results in extremely sparse text representation matrices. We propose the ‘neighbourhood-based assistance’-driven non-negative matrix factorization (NMF) method to handle high-dimensional sparse short-text representation with lower-dimensional projection effectively. We utilized NMF that aligned with the natural non-negativity of text data coupled with the symmetric document affinity information to identify topic distribution in the short text. Neighbourhood information within documents is captured using Jaccard similarity to assist information loss, resulting in higher-to-lower-dimensional projection. Experimental results with Twitter data sets show that the proposed approach is able to attain high accuracy compared to state-of-the-art methods quantitatively, while qualitative analysis with case studies validates the ability of the proposed approach in generating meaningful topic clusters.
Outlier detection in text data collections has become significant due to the need of finding anomalies in the myriad of text data sources. High feature dimensionality, together with the larger size of these document collections, presents a need for developing accurate outlier detection methods with high efficiency. Traditional outlier detection methods face several challenges including data sparseness, distance concentration, and the presence of a larger number of sub-groups when dealing with text data. In this article, we propose to address these issues by developing novel concepts such as presenting documents with the rare document frequency, finding ranking-based neighborhood for similarity computation, and identifying sub-dense local neighborhoods in high dimensions. To improve the proposed primary method based on rare document frequency, we present several novel ensemble approaches using the ranking concept to reduce the false identifications while finding the higher number of true outliers. Extensive empirical analysis shows that the proposed method and its ensemble variations improve the quality of outlier detection in document repositories as well as they are found scalable compared to the relevant benchmarking methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.