Text classification is important to better understand online media. A major problem for creating accurate text classifiers using machine learning is small training sets due to the cost of annotating them. On this basis, we investigated how SVM and NBSVM text classifiers should be designed to achieve high accuracy and how the training sets should be sized to efficiently use annotation labor. We used a four-way repeated-measures full-factorial design of 32 design factor combinations. For each design factor combination 22 training set sizes were examined. These training sets were subsets of seven public text datasets. We study the statistical variance of accuracy estimates by randomly drawing new training sets, resulting in accuracy estimates for 98,560 different experimental runs. Our major contribution is a set of empirically evaluated guidelines for creating online media text classifiers using small training sets. We recommend uni- and bi-gram features as text representation, btc term weighting and a linear-kernel NBSVM. Our results suggest that high classification accuracy can be achieved using a manually annotated dataset of only 300 examples.
Corporate reputation is an economic asset and its accurate measurement is of increasing interest in practice and science. This measurement task is difficult because reputation depends on numerous factors and stakeholders. Traditional measurement approaches have focused on human ratings and surveys, which are costly, can be conducted only infrequently and emphasize financial aspects of a corporation. Nowadays, online media with comments related to products, services, and corporations provides an abundant source for measuring reputation more comprehensively. Against this backdrop, we propose an information retrieval approach to automatically collect reputation-related text content from online media and analyze this content by machine learning-based sentiment analysis. We contribute an ontology for identifying corporations and a unique dataset of online media texts labelled by corporations' reputation. Our approach achieves an overall accuracy of 84.4%. Our results help corporations to quickly identify their reputation from online media at low cost.
In order to better align existing and future ICT implementations in the health domain with the strategic options defined by the National Plan for Health Development, the Ministry of Health (MoH) of Burundi initiated in 2014 the development of a national e-health enterprise architecture based on the TOGAF methodology. A first part of the development cycle consisted of a detailed analysis of regulatory documents and strategic plans related to the Burundian health system. In a second part, semi-structured interviews were organized with a representative sample of relevant MoH health structures. The study demonstrated the donor driven unequal distribution of hardware equipment over health administration components and health facilities. Internet connectivity remains problematic and few health oriented business applications found their way to the Burundian health system. Paper based instruments remain predominant in Burundi's health administration. The study also identified a series of problems introduced by the uncoordinated development of health ICT in Burundi such as the lack of standardization, data security risks, varying data quality, inadequate ICT infrastructures, an unregulated e-health sector and insufficient human capacity. The results confirm the challenging situation of the Burundian health information system but they also expose a number of bright spots that provide hope for the future: a political will to reclaim MoH leadership in the health information management domain, the readiness to develop e-health education and training programs and the opportunity to capitalize the experiences with DHIS2 deployment, results based financing monitoring and hospital information management systems implementation.
PurposeMachine learning (ML) models are increasingly being used in industrial maintenance to predict system failures. However, less is known about how the time windows for reading data and making predictions affect performance. Therefore, the purpose of this research is to assess the impact of different sliding windows on prediction performance.Design/methodology/approachThe authors conducted a factorial experiment using high dimensional machine data covering two years of operation, taken from a real industrial case for the production of high-precision milled and turned parts. The impacts of different reading and prediction windows were tested for three ML algorithms (random forest, support vector machines and logistic regression) and four metrics (accuracy, precision, recall and F-score).FindingsThe results reveal (1) the critical role of the prediction window contingent upon the application domain, (2) a non-monotonic relationship between the reading window and performance, and (3) how sliding window selection can systematically be used to improve different facets of performance.Originality/valueThe study's findings advance the knowledge of ML-based failure prediction, by highlighting how systematic variation of two important but yet understudied factors contributes to the development of more useful prediction models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.