Semantic image synthesis is to synthesize photorealistic images according to the given semantic layout. Existing methods try to build a single-scale style encoder based on semantic regions, which inject style simply based on a single level, are unable to extract rich style information. Especially for different instance objects in the same semantic region, single-scale networks tend to generate the same style and control style ineffectively. To cope with this issue, we propose Multi-Scale Instance-level image synthesis method (MSIN). In order to learn more discriminative representation from different feature levels in instance, a multi-scale style encoder is designed to extract more details instead of traditional single-scale style encoder, which adopts a "pyramid" structure to contact contextual information. In addition, to synthesize visually pleasing and photorealistic images, MSIN leverages the region-style fusion mechanism in adaptive normalization layer, which realizes instance-wise object-to-object multi-style generation simultaneously. Compared with the previous methods, our method can generate images with fine details and control style in instance object, whose semantics are more reasonable and diverse to different instance objects. The experimental results demonstrate the superiority of MSIN on dealing with semantic image synthesis tasks and outperforms existing methods in terms of instance objects and diverse generation.