
The Counterintuitive Finding
In a preprint posted on arXiv on May 22, 2026, researchers Nick Merrill, Jaeho Lee, and Ezra Karger present a startling result: more capable large language models (LLMs) make worse forecasts precisely when those forecasts matter most. The paper, titled 'Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most,' examines LLM performance on prediction tasks where the cost of error is high. According to the study, as model capability—measured by standard benchmarks—increases, calibration on consequential forecasts declines, leading to overconfident and less accurate predictions. This finding runs counter to the prevailing assumption that scaling model size and training data uniformly improves all downstream behaviors.
Why Capability May Be a Liability

The researchers designed a series of forecasting tasks drawn from domains such as geopolitical events, financial markets, and public health outcomes. They compared several LLMs of varying capability, including models from the GPT and LLaMA families, and measured both accuracy and calibration (the alignment between predicted confidence and actual correctness). The results show a clear inverted U-shape: models at moderate capability achieve the best calibration, while the most capable models become overconfident, producing high-confidence predictions that are no more accurate—and often less accurate—than those of smaller models. The effect is most pronounced in high-stakes scenarios, where the cost of error amplifies the consequences of mis-calibration. The authors note that this pattern is consistent across different model families and prompt formats, suggesting it is not an artifact of a particular architecture.
Possible Explanations
The paper offers several hypotheses for why increased capability might degrade forecasting performance. One possibility is that more capable models have been extensively fine-tuned to appear confident and helpful, which can lead to overconfidence in novel or ambiguous situations. Another is that the training data for large models includes a disproportionate amount of confident assertions, reinforcing a bias toward certainty. The authors also point to the phenomenon of 'global competence' masking local weaknesses: a model that excels at many tasks may be trusted to output reliable probabilities even in domains where its actual knowledge is limited. The study includes error analyses showing that the most capable models often generate plausible-sounding but factually incorrect reasoning to justify their forecasts, a behavior reminiscent of the 'bullshit' problem identified in earlier LLM research.

Implications for AI Deployment
If confirmed by further research, this finding has significant implications for deploying LLMs in high-stakes decision-making roles. Systems used for geopolitical forecasting, medical diagnosis, or financial risk assessment often assume that more powerful models yield better predictions. The Merrill et al. study suggests the opposite may be true in corner cases that matter most. The authors recommend that practitioners prioritize calibration-aware evaluation over raw accuracy benchmarks when deploying LLMs for forecasting. They also caution against using 'capability' as a proxy for trustworthiness in critical applications. The paper aligns with a growing body of work on the risks of over‑alignment and sycophancy in large models, reinforcing the need for robust uncertainty quantification in production systems.
What's Next
The preprint is under review for a major AI conference, and the researchers have released code and evaluation frameworks to allow replication and extension of their results. The direction of future work is likely to include interventions to improve calibration in capable models, such as specialized training objectives, uncertainty-aware prompting, or post-hoc recalibration. For now, the study serves as a cautionary tale: in the rush to scale, the AI community must not assume that bigger and better automatically means more reliable. The most useful forecast may come from the model that knows what it does not know.
コメント