Scheduling tasks close to their associated data is crucial in distributed systems to minimize network traffic and latency. Some Big Data frameworks like Apache Spark employ locality functions and job allocation algorithms to minimize network traffic and execution times. However, these frameworks rely on centralized mechanisms, where the master node determines data locality by allocating tasks to available workers with minimal data transfer time, ignoring variances in worker configurations and availability. To address these limitations, we propose a decentralized approach to locality-driven scheduling that grants workers autonomy in the job allocation process while factoring in workers' configurations, such as network and CPU speed differences. Our approach is developed and evaluated on Crossflow, a distributed stream processing platform with data-aware independent worker nodes. Preliminary evaluation experiments indicate that our approach can yield up to 3.57x faster execution times when compared to the baseline centralized approach where the master controls data locality.
CCS CONCEPTS• Computing methodologies → Distributed algorithms; • Software and its engineering → Development frameworks and environments.