Model explanations are generated by XAI (explainable AI) methods to help people understand and interpret machine learning models. To study XAI methods from the human perspective, we propose a human-based benchmark dataset, i.e., human saliency benchmark (HSB), for evaluating saliency-based XAI methods. Different from existing human saliency annotations where class-related features are manually and subjectively labeled, this benchmark collects more objective human attention on vision information with a precise eye-tracking device and a novel crowdsourcing experiment. Taking the labor cost of human experiment into consideration, we further explore the potential of utilizing a prediction model trained on HSB to mimic saliency annotating by humans. Hence, a dense prediction problem is formulated, and we propose an encoder-decoder architecture which combines multi-modal and multi-scale features to produce the human saliency maps. Accordingly, a pretraining-finetuning method is designed to address the model training problem. Finally, we arrive at a model trained on HSB named human saliency imitator (HSI). We show, through an extensive evaluation, that HSI can successfully predict human saliency on our HSB dataset, and the HSI-generated human saliency dataset on ImageNet showcases the ability of benchmarking XAI methods both qualitatively and quantitatively.