Summary
In this paper, the probabilistic modeling based on resilient distribution dataset (RDD) of Spark platform is proposed to efficiently process the large‐scale sample data of renewable energy source (RES). Based on Spark and Hadoop distributed file system, a parallel and distributed framework compatible with on‐hand RES data storage systems is firstly designed for the fast probabilistic modeling of RES. On the basis of the designed framework, a novel parallel estimation algorithm of Wakeby distribution as well as kernel density estimation is developed based on RDD. With the in‐memory parallel computing and fault‐tolerant characteristics of RDD, the proposed algorithms significantly enhance the parallel execution performance of probabilistic. Besides, the approximate analytical relationship among time consumptions of the proposed algorithms, two important adjustable parameters (degree of parallelism and the number of partitions) of Spark platform, and large sample size of RES is derived, which is helpful for prediction of computational time, hardware configuration setting, and program tuning in the Spark platform. Simulation results with sample size ranging from 7.3 × 106 to 3.6 × 109 demonstrate the correctness and effectiveness of the proposed techniques.