Big data applications (e.g., Extract, Transform, and Load (ETL) applications) are designed to handle great volumes of data. However, processing such great volumes of data is time-consuming. There is a need to construct small yet effective test data sets during agile development of big data applications. In this paper, we apply a combinatorial test data generation approach to two real-world ETL applications at Medidata. In our approach, we first create Input Domain Models (IDMs) automatically by analyzing the original data source and incorporating constraints manually derived from requirements. Next, the IDMs are used to create test data sets that achieve t-way coverage, which has shown to be very effective in detecting software faults. The generated test data sets also satisfy all the constraints identified in the first step. To avoid creating IDMs from scratch when there is a change to the original data source or constraints, our approach extends the original IDMs with additional information. The new IDMs, which we refer to as Adaptive IDMs (AIDMs), are updated by comparing the changes against the additional information, and are then used to generate new test data sets. We implement our approach in a tool, called comBinatorial bIg daTa Test dAta Generator (BIT-TAG). Our experience shows that combinatorial testing can be effectively applied to big data applications. In particular, the test data sets created using our approach for the two ETL applications are only a small fraction of the original data source, but we were able to detect all the faults found with the original data source. CCS Concepts •Software and its engineering → Software testing and debugging;