A growing body of research indicates that forecasting skill is a unique and stable trait: forecasters with a track record of high accuracy tend to maintain this record. But how does one identify skilled forecasters effectively? We address this question using data collected during two seasons of a longitudinal geopolitical forecasting tournament. Our first analysis, which compares psychometric traits assessed prior to forecasting, indicates intelligence consistently predicts accuracy. Next, using methods adapted from classical test theory and item response theory, we model latent forecasting skill based on the forecasters’ past accuracy, while accounting for the timing of their forecasts relative to question resolution. Our results suggest these methods perform better at assessing forecasting skill than simpler methods employed by many previous studies. By parsing the data at different time points during the competitions, we assess the relative importance of each information source over time. When past performance information is limited, psychometric traits are useful predictors of future performance, but, as more information becomes available, past performance becomes the stronger predictor of future accuracy. Finally, we demonstrate the predictive validity of these results on out-of-sample data, and their utility in producing performance weights for wisdom-of-crowds aggregations.
Human forecasts and other probabilistic judgments can be improved by elicitation and aggregation methods. Recent work on elicitation shows that deriving probability estimates from relative judgments (the ratio method) is advantageous, whereas other recent work on aggregation shows that it is beneficial to transform probabilities into coherent sets (coherentization) and to weight judges' assessments by their degree of coherence. We report an experiment that links these areas by examining the effect of coherentization and multiple forms of coherence weighting using direct and ratio elicitation methods on accuracy of probability judgments (both forecasts and events with known distributions). We found that coherentization invariably yields improvements to accuracy. Moreover, judges' levels of probabilistic coherence are related to their judgment accuracy. Therefore, coherence weighting can improve judgment accuracy, but the strength of the effect varies among elicitation and weighting methods. As well, the benefit of coherence weighting is stronger on “calibration” items that served as a basis for establishing the weights than for unrelated “test” items. Finally, echoing earlier research, we found overconfidence in judgment, and the degree of overconfidence was comparable between the two elicitation methods.
In most forecasting contexts, each target event has a resolution time point at which the “ground truth” is revealed or determined. It is reasonable to expect that as time passes, and information relevant to the event resolution accrues, the accuracy of individual forecasts will improve. For example, we expect forecasts about stock prices on a given date to be more accurate as that date approaches, or forecasts about sport tournament winners to become more accurate as the tournament progresses. This time dependence presents several issues for extracting the wisdom of crowds, and for optimizing differential weights when members of the crowd forecast the same event at different times. In this chapter, we discuss the challenges associated with this time dependence and survey the quality of the various solutions in terms of collective accuracy. To illustrate, we use data from the Hybrid Forecasting competition, where volunteer non-professional forecasters predicted multiple geopolitical events with time horizons of several weeks or months, as well as data from the European Central Bank’s Survey of Professional Forecasters which includes only a few select macroeconomic indices, but much longer time horizons (in some cases, several years). We address the problem of forecaster assessment, by showing how model-based methods may be used as an alternative to proper scoring rules for evaluating the accuracy of individual forecasters; how information aggregation can weigh concerns of forecast recency as well as sufficient crowd size; and explore the relationship between crowd size, forecast timing and aggregate accuracy. We also provide recommendations both for managers seeking to select the best analysts from the crowd, as well as aggregators looking to make the most of the overall crowd wisdom.
Who is good at prediction? Addressing this question is key to recruiting and cultivating accurate crowds and effectively aggregating their judgments. Recent research on superforecasting has demonstrated the importance of individual, persistent skill in crowd prediction. This chapter takes stock of skill identification measures in probability estimation tasks, and complements the review with original analyses, comparing such measures directly within the same dataset. We classify all measures in five broad categories: 1) accuracy-related measures, such as proper scores, model-based estimates of accuracy and excess volatility scores; 2) intersubjective measures, including proxy, surrogate and similarity scores; 3) forecasting behaviors, including activity, belief updating, extremity, coherence, and linguistic properties of rationales; 4) dispositional measures of fluid intelligence, cognitive reflection, numeracy, personality and thinking styles; and 5) measures of expertise, including demonstrated knowledge, confidence calibration, biographical, and self-rated expertise. Among non-accuracy-related measures, we report a median correlation coefficient with outcomes of r = 0.20. In the absence of accuracy data, we find that intersubjective and behavioral measures are most strongly correlated with forecasting accuracy. These results hold in a LASSO machine-learning model with automated variable selection. Two focal applications provide context for these assessments: long-term, existential risk prediction and corporate forecasting tournaments.
Forecasting of geopolitical events is a notoriously difficult task, with experts failing to significantly outperform a random baseline across many types of forecasting events. One successful way to increase the performance of forecasting tasks is to turn to crowdsourcing: leveraging many forecasts from non-expert users. Simultaneously, advances in machine learning have led to models that can produce reasonable, although not perfect, forecasts for many tasks. Recent efforts have shown that forecasts can be further improved by "hybridizing" human forecasters: pairing them with the machine models in an effort to combine the unique advantages of both. In this demonstration, we present Synergistic Anticipation of Geopolitical Events (SAGE), a platform for human/computer interaction that facilitates human reasoning with machine models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.