Sofus A. Macskassy scite author profile

We examine whether a simple quantitative measure of language can be used to predict individual firms' accounting earnings and stock returns. Our three main findings are: (1) the fraction of negative words in firm-specific news stories forecasts low firm earnings; (2) firms' stock prices briefly underreact to the information embedded in negative words; and (3) the earnings and return predictability from negative words is largest for the stories that focus on fundamentals. Together these findings suggest that linguistic media content captures otherwise hard-to-quantify aspects of firms' fundamentals, which investors quickly incorporate into stock prices. Copyright (c) 2008 by The American Finance Association.

show abstract

A Simple Relational Classifier

Macskassy¹,

Provost²

2003

196

214

View full text Add to dashboard Cite

Abstract. We analyze a Relational Neighbor (RN) classifier, a simple relational predictive model that predicts only based on class labels of related neighbors, using no learning and no inherent attributes. We show that it performs surprisingly well by comparing it to more complex models such as Probabilistic Relational Models and Relational Probability Trees on three data sets from published work. We argue that a simple model such as this should be used as a baseline to assess the performance of relational learners. MotivationIn recent years, we have seen remarkable advances in algorithms for relational learning, especially statistically based algorithms. These algorithms have been developed in a wide variety of different research fields and problem settings. Relational data differ from traditional data in that they violate the instance-independence assumption. Instances can be related, or linked, in various ways. The label of an instance might depend on the instances it is related to either directly or through arbitrarily long chains of relations. This relational structure further complicates matters as it makes it harder, if not impossible, to separate the data cleanly into test and train sets without losing much relational information. Recent work has begun to investigate foundational issues within relational learning, such as the dimensions across which learners can be compared [11,14,25] as well as issues of link dependencies [13]. We broaden these investigations by describing a baseline method to which relational learners should be compared when assessing how well they have extracted a useful model from the given relational structure-beyond what can be achieved by looking only at known class labels of related neighbors.Recent probabilistic relational learning algorithms-e.g., Probabilistic Relational Models (PRMs) [16,10,27], Relational Probability Trees (RPTs) [22] and Relational Bayesian Classifiers (RBCs) [23]-search the relational space for useful attributes and relational structure of neighbors (possibly more than one link away). While there are other relational learning algorithms available [7,9,6], we focus in this paper on the three named algorithms.We know from classical machine learning that even very simple statistical methods such as naive Bayes can perform remarkably well even when compared to more complex methods. However, a question that has yet to receive much attention is how much of the performance of relational learners is due to their complexity and how much can

show abstract

Discovering users' topics of interest on twitter

2010

View full text Add to dashboard Cite

Using graph-based metrics with empirical risk minimization to speed up active learning on networked data

Macskassy

2009

View full text Add to dashboard Cite

Active and semi-supervised learning are important techniques when labeled data are scarce. Recently a method was suggested for combining active learning with a semi-supervised learning algorithm that uses Gaussian fields and harmonic functions. This classifier is relational in nature: it relies on having the data presented as a partially labeled graph (also known as a within-network learning problem). This work showed yet again that empirical risk minimization (ERM) was the best method to find the next instance to label and provided an efficient way to compute ERM with the semisupervised classifier. The computational problem with ERM is that it relies on computing the risk for all possible instances. If we could limit the candidates that should be investigated, then we can speed up active learning considerably. In the case where the data is graphical in nature, we can leverage the graph structure to rapidly identify instances that are likely to be good candidates for labeling. This paper describes a novel hybrid approach of using of community finding and social network analytic centrality measures to identify good candidates for labeling and then using ERM to find the best instance in this candidate set. We show on real-world data that we can limit the ERM computations to a fraction of instances with comparable performance. Categories and Subject Descriptors General TermsAlgorithms, Design, Experimentation, Performance Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Keywordsactive learning, statistical relational learning, semisupervised learning, social network analysis, betweenness centrality, closeness centrality, community finding, clustering, empirical risk minimization, within-network learning MOTIVATIONActive learning and semi-supervised learning are both important techniques when labeled data are scarce and unlabeled data are abundant. Active learning targets the situation where it is costly to get more labeled data so which instance(s) should you get labels for in order to get the best learned model. In such a scenario, we let the learning algorithm pick a set of unlabeled instances to be labeled by an oracle (i.e., human), which will then be used as (or to augment) the labeled data set. In other words, we let the learning algorithm tell us which instances to label, rather than selecting them randomly. Active learning is named as such because the learner actively asks for more labels in order to increase its efficacy, thereby minimizing the amount of labeled data needed to get a good model. Semisupervised learning takes an orthogonal approach to active learning and instead uses unlabeled data to help supervised learning tasks. The name "...

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.