Auditory scene analysis involves the simultaneous grouping and parsing of acoustic data into separate mental representations (i.e., objects). Over two experiments, we examined the sequence of neural processes underlying concurrent sound segregation by means of recording of human middle latency auditory evoked responses. Participants were presented with complex sounds comprising several harmonics, one of which could be mistuned such that it was not an integer multiple of the fundamental frequency. In both experiments, Na (approximately 22 ms) and Pa (approximately 32 ms) waves were reliably generated for all classes of stimuli. For stimuli with a fundamental frequency of 200 Hz, the mean Pa amplitude was significantly larger when the third harmonic was mistuned by 16% of its original value, relative to when it was tuned. The enhanced Pa amplitude was related to an increased likelihood in reporting the presence of concurrent auditory objects. Our results are consistent with a low-level stage of auditory scene analysis in which acoustic properties such as mistuning act as preattentive segregation cues that can subsequently lead to the perception of multiple auditory objects.