To ensure the effective functioning of the university educational environment, document flow processes automation, which includes the task of documents automatic classification, is of great importance. The article considers the task of classifying university documents by machine learning methods in order to improve the quality of classification. Documents preprocessing was carried out, which made it possible to distinguish significant words in documents, due to which the accuracy of documents classification increased. Described are methods of extracting features from text TF and TF-IDF, which determine keywords by words frequency included in document. A modification of the TF-IDF method is proposed, which consists in calculating the words importance depending on their part of speech. This made it possible to improve the classification quality by highlighting only important and significant words in documents. Suggested is a classification algorithm using a method of support vectors to reduce the documents number involved in classification and a method of k-nearest neighbor for classification. The advantage of this algorithm over the described analogues is shown, which is expressed in the number of mistakenly classified documents decrease.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.