Conceptual design evaluation is an indispensable component of innovation in the early stage of engineering design. Properly assessing the effectiveness of conceptual design requires a rigorous evaluation of the outputs. Traditional methods to evaluate conceptual designs are slow, expensive, and difficult to scale because they rely on human expert input. An alternative approach is using computational methods to evaluate design concepts. However, most existing methods have limited utility because they are constrained to unimodal design representations (e.g., texts or sketches). To overcome these limitations, we propose an attention-enhanced multimodal learning-based model to predict five design metrics: drawing quality, uniqueness, elegance, usefulness, and creativity. The proposed model utilizes knowledge from large external datasets through transfer learning, simultaneously processes text and sketch data from early-phase concepts, and effectively fuses the multimodal information through a mutual cross-attention mechanism. To study the efficacy of multimodal learning (MML) and attention-based information fusion, we compare (1) a baseline MML model and the unimodal models and (2) the attention-enhanced models with baseline models in terms of their explanatory power for the variability of the design metrics. The results show that MML improves model explanatory power by 5%-12% and the mutual cross-attention mechanism further increases the explanatory power by 5%-9%, leading to a highest explanatory power of 44% for drawing quality, 60% for uniqueness, 45% for elegance, 43% for usefulness, and 32% for creativity. Our findings highlight the benefit of using multimodal representations for design metric assessment.