Through constant exposure to adult input, in dialogue, children's language gradually develops into rich linguistic constructions that contain multiple cross-modal elements subtly used together for coherent communicative functions. In this chapter, we retrace children's pathways into multimodal language acquisition in a scaffolding interactional environment. We begin with the first multimodal buds children produce that contain both gestural and vocal elements and how adults' input, including reformulations and recasts, provide children with embedded model utterances they can internalize. We then show how these buds blossom into more complex constructions, focusing on the importance of creative non standard forms. Children's productions finally bloom into full multimodal intricate productions. In our last part, we focus on argument structure, Tense, Mood and Aspect and the complexification of co-verbal gestures as they are coordinated with speech.