This study evaluates and compares the performance of Coupled Model Intercomparison Project Phase 6 (CMIP6) and CMIP5 in simulating the runoff on global scale and eight large-scale basins, over the period 1981–2005 using percent bias (PBIAS), correlation coefficient (CC), root mean square error (RMSE), Theil-Sen median trend, and the Taylor diagram. The CMIP models are ranked by comprehensive rating index (MR), which is determined by PBIAS, CC and RMSE three metrics. LORA, GRUN and ERA5-Land were selected as reference data sets. LORA was used as the main reference data to evaluate the historical runoff results of CMIP from 1981 to 2012 for three aspects: trend, PBIAS and uncertainty. Results reveal that: (i) CMIP6 models have obviously overvalued on the global and basins (except Amazon and Lena basin), this phenomenon was more prominent in arid and semi-arid areas ( Murray-Darling and Nile basin). (ii) Compared with CMIP5 models, CMIP6 models have less uncertainty on the global scale, but it has not made outstanding progress on the basin scale. (iii) CMIP6 multi‐model ensemble mean (CMIP6_MMEs) has better simulation effect than most individual models, which reduces the uncertainty among different models to some extent. (iv) There were differences in trends and PBIAS between the three reference data sets at both the global and basin scale. However, the interannual fluctuations of the three data sets were basically the same and have high correlation coefficient (except for ERA5 in the world and Nile basin), which shows that LORA data set has high reliability. The global comprehensive rating metric (GR) of CMIP6_MMEs was better than CMIP5_MMEs in all metrics, but this result was not found in eight basins. This shows that CMIP6 models has better effect in simulating global runoff and related diagnostic indicators. Implying further improvements are needs for the runoff simulation capability at the basin scale.