If you ask a human to describe an image, they might do so in a thousand different ways. Image captioning models, on the other hand, are traditionally trained to generate a single "best" (most like a reference) caption. Unfortunately, doing so encourages captions that are informationally impoverished. Such captions often focus on only a subset of possible details, while ignoring other potentially useful information in the scene. In this work, we introduce a simple, yet novel, method: "Image Captioning by Committee Consensus" (IC 3 ), designed to generate a single caption that captures details from multiple viewpoints by sampling from the learned semantic space of a base captioning model, and carefully leveraging a large language model to synthesize these samples into a single comprehensive caption. Our evaluations show that humans rate captions produced by IC 3 more helpful than those produced by SOTA models more than two-thirds of the time, and IC 3 improves the performance of SOTA automated recall systems by up to 84%, outperforming single human-generated reference captions and indicating significant improvements over SOTA approaches for visual description. Code/Resources are available at https://davidmchan. github.io/caption-by-committee.