In this paper, we present Latent Drichlet Allocation in automatic text summarization to improve accuracy in document clustering. The experiments involving 398 data set from public blog article obtained by using python scrapy crawler and scraper. Several steps of clustering in this research are preprocessing, automatic document compression using feature method, automatic document compression using LDA, word weighting and clustering algorithm The results show that automatic document summarization with LDA reaches 72% in LDA 40%, compared to traditional k-means method which only reaches 66%.
Keyword and keyphrase extraction are one of the initial foundations for performing several text processing operations such as summarization and document clustering. YAKE is one of the techniques used for unsupervised and independent keyphrase extraction, it does not require a corpus for linguistic tools such as NER and POS-tag. However, the use of YAKE in microblogging documents such as Twitter often results in a keyphrase that is less representative because of the lack of words used for ranking. This paper offers a solution to this problem by looking for similar tweets in the keyphrase extraction process using Doc2Vec so that the number of words used in the YAKE ranking process can be greater. Covid-19 tweets related are used as dataset as the topic is currently widely discussed on social media to prove that the proposed approach could improve keyphrase extraction performance
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.