The emergences in computing technologies have broadened the horizon for vision-based surveillance, monitoring and control. However, complex and inferior feature learning over visual artefacts or video streams, especially under extreme conditions confine majority of the at-hand vision-based crowd analysis and classification systems. Retrieving event-sensitive or crowd-type sensitive spatio-temporal features for the different crowd types under extreme conditions is highly complex task. Despite numerous efforts in vision-based approaches, the lack of acoustic cues often create ambiguity in crowd classification. In this research, a novel audio-based feature learning model is developed for crowd analysis and classification. In this work, the audio samples (from the input video) were processed for static (fixed size) sampling, pre-emphasis, block framing and Hann windowing, followed by acoustic feature extraction like GTCC, GTCC-Delta, GTCC-Delta-Delta, MFCC, Spectral Entropy, Spectral Flux, Spectral Slope and Harmonics to noise Ratio (HNR). Finally, the extractedacousticfeatures were processed for classification using the random forest ensemble classifier. The audio-basedclassification model yield classification accuracy of 92.67%, precision of 93.80%, sensitivity82.91%, specificity of 90.48% and F-Measure of 0.9239.