1 Summary Gloss perception is a challenging visual inference that requires disentangling the contributions of reflectance, lighting, and shape to the retinal image [1][2][3]. Learning to see gloss must somehow proceed without labelled training data as no other sensory signals can provide the 'ground truth' required for supervised learning [4][5][6]. We reasoned that paradoxically, we may learn to infer distal scene properties, like gloss, by learning to compress and predict spatial structure in proximal image data. We hypothesised that such unsupervised learning might explain both successes and failures of human gloss perception, where classical 'inverse optics' cannot. To test this, we trained unsupervised neural networks to model the pixel statistics of renderings of glossy surfaces and compared the resulting representations with human gloss judgments.The trained networks spontaneously cluster images according to underlying scene properties such as specular reflectance, shape and illumination, despite receiving no explicit information about them. More importantly, we find that linearly decoding specular reflectance from the model's internal code predicts human perception and misperception of glossiness on an image-by-image basis better than the true physical reflectance does, better than supervised networks explicitly trained to estimate specular reflectance, and better than alternative image statistic and dimensionality-reduction models. Indeed, the unsupervised networks correctly predict well-known illusions of gloss perception caused by interactions between surface relief and lighting [7,8] which the supervised models totally fail to predict. Our findings suggest that unsupervised learning may explain otherwise inexplicable errors in surface perception, with broader implications for how biological brains learn to see the outside world.2 Highlights We trained unsupervised neural networks to synthesise images of glossy surfaces They spontaneously learned to encode gloss, lighting and other scene factors The networks correctly predict both errors and successes of human gloss perception The findings provide new insights into how the brain likely learns to see 3
Results and DiscussionThe central intuition behind our findings is that learning to compress the complex image structure created by reflections from glossy surfaces into a highly compact code forces the brain to discover representations that partially-but imperfectly-disentangle the distal physical factors responsible for variations within and between images. This potentially explains both the broad successes and specific pattern of errors known to occur in gloss perception [2,3,[7][8][9][10][11][12]. To test this, we rendered 10,000 images from a virtual world of bumpy frontoparallel surfaces with either high or low specular reflectance, random colour and depth of surface relief, illuminated by six natural light fields (Figure 1A-B). We trained ten instances of an unsupervised PixelVAE network [13,14] on this dataset. The model's training objecti...