The increasing density of silicon processes, coupled with the development of ever more energy and space efficient embedded core designs, has led to multi-processor system-on-chip (MPSoC) designs becoming increasingly attractive for use in embedded systems. Unfortunately this increase in core count gives rise to an explosion in design space possibilities, especially when heterogeneous designs are considered. To address this problem, new techniques in simulation are required to increase the simulation performance of these systems, while maintaining the accuracy needed to make good design decisions, and to verify the performance characteristics for realtime systems. We present a new high-speed, near cycle-accurate simulator, addressing an important but neglected category of multicore systems: deeply-embedded cacheincoherent MPSoCs. We take advantage of the unique properties of these systems to relax synchronisation constraints and increase the parallelism of the simulation. In doing so we achieve performance not possible using previous simulation techniques, without compromising the accuracy of the results. Quantitative performance results are presented across a large range of simulated MPSoC designs, comprising 1-64 cores, on average we simulate at 5.7 MIPS, with simulation speeds reaching 377 MIPS in the best case. Comparing against FPGA implementations we demonstrate that the simulator manages this with an average timing error of only 2.1%. Applying some of B Christopher Thompson