The visual system can rapidly summarize multiple objects in a form of ensemble statistics: e.g., people can easily estimate an average size of apples on a tree. To accomplish this, it is not always enough to summarize all the visual information. If there are various types of objects, the visual system should select a relevant subset: only apples without leaves. Here, we ask: what is the representational basis of ensemble selection, i.e., what kind of visual information makes a ‘good’ ensemble that can be selectively attended to provide an accurate summary estimate? We tested three candidate representations: basic features, preattentive object files, and full-fledged bound objects. In four experiments, we presented a target and several distractors’ sets of differently colored objects. We found that conditions, where a target ensemble had at least one unique color (basic feature), provided ensemble averaging performance comparable to the baseline displays without distractors. When the target subset was defined as a conjunction of two colors or color-shape partly shared with distractors (so that they could be differentiated only as preattentive object files), subset averaging was also possible but less accurate than in the baseline and the feature conditions. Finally, performance was very poor when the target subset was defined by an exact feature relationship, such as in the spatial conjunction of two colors (spatially bound object). Overall, these results suggest that distinguishable features and, to a lesser degree, preattentive object files can serve as the representational basis of ensemble selection, while bound objects cannot.