The present study investigates the test-retest reliability of category loudness scaling with pure tones for each of the scaling categories: 'very soft', 'soft', 'OK', 'loud', 'very loud' and 'too loud' at the audiometric frequencies 0.5, 1 kHz, 2 kHz and 4 kHz. Category loudness scaling at two sessions separated by between 1 and 4 weeks was obtained from 16 normal-hearing subjects who all had normal otoscopy, present acoustic reflexes at audiometric frequencies 0.5-4 kHz and middle ear pressure within +/-50 daPa. Intra-subject between-session reliability was found not to be frequency dependent, and comparison with other studies revealed that reliability is not dependent on the applied stimulus signal. Test-retest reliability varied between the different categories: In the categories 'very soft', 'loud', 'very loud' and 'too loud' the test reliability is in the same range as found for hearing thresholds determination, whereas for the 'soft' and 'OK' categories it is poorer. The greater uncertainty for intermediate levels should be considered when using category loudness scaling, e.g. for calculating hearing aid parameters.