When a robot is brought into a new environment, it has a very limited knowledge of what surrounds it and what it can do. Either to navigate in the world, or to interact with humans, the robot must be able to learn complex states, using input information from sensors. For navigation task, visual information are commonly used for localisation. Other signals are usually employed: ultrasounds, lasers and path integration are as many data that can be taken into account. For human-robot interactions the proprioceptive information, like the values of the articulations and the grip state, are additional degrees in the system that can to be enrolled in the analysis. All those signals have different dynamics, and the system must manage all those differences. To do so, a solution is to introduce different learning levels. An architecture able to create complex multimodal contexts, to recognise them and to use them afterwords in higher level strategies would give the capacity to resolve such situations. In this paper, a model has been introduced for complex categorisation problems, patterning and chunk learning. It is used in a complex architecture to create multimodal states for a cognitive map and to resolve ambiguities. Tests were performed in both a simulated navigation task and in a complex arm manipulation and localisation experiment.