The evaluation of cache-based systems demands careful simulations of entire benchmarks. Simulation efficiency is essential to realistic evaluations. For systems with large caches and large number of processors, simulation is often too slow to be practical. In particular, the optimized design of a cache for a multiprocessor is very complex with current techniques. This paper addresses these problems. First we introduce necessary and sufficient conditions for cache inclusion in systems with invalidations. Second, under cache inclusion, we show that an accurate trace for a given processor or for a cluster of processors can be extracted from a multiprocessor trace, With this methodology, possible cache architectures for a processor or for a cluster of processors are evaluated independently of the mt of the system, resulting in a drastic reduction of the trace length and simulation complexity. Moreover, many important system-wide metrics can be estimated with good accuracy by extracting the traces of a set of randomly selected processors, an approach we call pr-ocessor sampling. We demonstrate the accuracy and efficiency of these techniques by applying them to three 64-processor traces.