It is commonly believed that human perceptual experiences can be, and usually are, multimodal. What is more, a stronger thesis is often proposed that some perceptual multimodal characters cannot be described simply as a conjunction of unimodal phenomenal elements. If it is the case, then a question arises: what is the additional mode of combination that is required to adequately describe the phenomenal structure of multimodal experiences? The paper investigates what types of audio-visual experiences have phenomenal character that cannot be analysed as a mere conjunction of visual and auditory elements; and how can we properly characterise the required, additional mode of perceptual combination. Three main modes of combination are considered: (a) instantiation, (b) parthood, and (c) grouping. It is argued that some phenomena involving intermodal relations, like spatial and temporal ventriloquism, can be analysed in terms of audio-visual, perceptual grouping. On the other hand, cases of intermodal binding need a different treatment. Experiences involving audio-visual binding should be analysed as experiences presenting objects or events which instantiate, or which have a proper part instantiating, both visually and auditorily determined properties.In contemporary philosophy of perception, it is commonly believed that human perceptual experiences can be, and usually are, multimodal (see Briscoe 2016;