This research creates an architecture for investigating the existence of probable lexical divergences between articles, categorized as Institute for Scientific Information (ISI) and non-ISI, and consequently, if such a difference is discovered, to propose the best available classification method. Based on a collection of ISI- and non-ISI-indexed articles in the areas of business and computer science, three classification models are trained. A sensitivity analysis is applied to demonstrate the impact of words in different syntactical forms on the classification decision. The results demonstrate that the lexical domains of ISI and non-ISI articles are distinguishable by machine learning techniques. Our findings indicate that the support vector machine identifies ISI-indexed articles in both disciplines with higher precision than do the Naïve Bayesian and K-Nearest Neighbors techniques
We investigate the identification and analysis of linguistic (lexico-grammatical) features that are characteristically used by articles of a specific year of publication. Linguistic features differ from shallow features because they represent authors' lexico-grammatical writing styles and do not consider well-known bag-of-words model. Current literature focusses on shallow features rather than on linguistic features and existing methods for identifying linguistic features use well-known knowledge-structure based approaches. In contrast to this, we advance these existing methods by applying semantic clustering instead of using knowledge-structure based approaches. For evaluation purpose, a linguistic feature-based prediction model is built to enable an automated assignment of articles to their years of publication. In a case study, the proposed methodology is applied to articles of the Springer book series 'Communications in Computer and Information Science' published from 2009 to 2013. The Case study results show the feasibility of the proposed approach as compared to frequently used baseline.Keywords: Scientific articles, Linguistic features, Latent semantic indexing, Text Mining. INTRODUCTIONWe investigate the occurrence of linguistic (lexico-grammatical) features in articles to show that they can be used for assigning articles to their years of publication. The Literature shows related approaches that can be used to assign articles to a pre-defined class. A domain-specific vocabulary (key words) is often used for this classification task. Different domains can be well distinguished by the distribution of specific key words as shown by existing bag-of-words approaches [1]- [5]. Further, trend analysis and bibliometric research also show that key word distributions can be used to identify a time period [6]. They trace topic changes over time within a domain. Thus, these approaches can estimate an article's publication year based on the used topics.The approaches as mentioned above are based on shallow (bag-of-words) features. They are in contrast to linguistic features such as specific word class distributions that indicate authors' lexico-grammatical writing styles. Literature also shows the possibilities of using linguistic features for classification. [7] investigate the impact of linguistic features on different scientific disciplines and on different points in time. A further approach uses linguistic features for spam detection [8]. Both approaches are based on systemic functional linguistics, in which a knowledge-structure based classifier (e.g. support vector machine) is used.We provide a new approach that identifies articles' linguistic features and that investigates their usage at different points in time. In contrast to previous work, clustering is used instead of classification. Text classification assigns a text to the given pre-defined classes. Classes are normally defined in a way that they cover all known linguistic features that are expected to occur within the given texts. Text clusteri...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.