Event extraction from multimodal documents is an important yet under-explored problem. One challenge faced by this task is the scarcity of paired image-text datasets, making it difficult to fully exploit the strong representation power of multimodal language models. In this paper, we present Theia, an end-to-end multimodal event extraction framework that can be trained on incomplete data. Specifically, we couple a generation-based event extraction model with a customised image synthesizer that can generate images from text. Our model leverages capabilities of pre-trained visionlanguage models and can be trained on incomplete (i.e. text-only) data. Experimental results on existing multimodal datasets demonstrate the effectiveness of our approach for both synthesising missing data and extracting events over state-of-the-art approaches.