Perceptual illusions across multiple modalities, such as the rubber-hand illusion, show how dynamic the brain is at adapting its body image and at determining what is part of it (the self) and what is not (others). Several research studies showed that redundancy and contingency among sensory signals are essential for perception of the illusion and that a lag of 200–300 ms is the critical limit of the brain to represent one’s own body. In an experimental setup with an artificial skin, we replicate the visuo-tactile illusion within artificial neural networks. Our model is composed of an associative map and a recurrent map of spiking neurons that learn to predict the contingent activity across the visuo-tactile signals. Depending on the temporal delay incidentally added between the visuo-tactile signals or the spatial distance of two distinct stimuli, the two maps detect contingency differently. Spiking neurons organized into complex networks and synchrony detection at different temporal interval can well explain multisensory integration regarding self-body.