This article presents an optimization method of the parallelism extraction algorithm using spanning tree that automatically exploits the parallelism and determines an execution order of multiple kernel programs in a distributed environment. In streambased computing, efficient parallel execution requires careful scheduling of the invocation of the kernel programs. By mapping a kernel to a node and an I/O stream to an edge, the entire stream process can be treated as a spanning tree. The spanning tree, which allows feedback and feedforward edges, is effective for expressing dependencies that exist among kernels. In spanning tree, the nodes at the same depth do not have edges between them, and thus can be executed in parallel in the case parent nodes have been already executed. The series of the nodes can be executed in a pipelined manner. Thus, the proposed algorithm can extract both spatial and temporal parallelism. However, if the algorithm is applied for feedbacks as it is, because of waiting for the completion of the loop among the nodes, it causes the waste of time. To solve this problem, the parallel pattern can be optimized in the step of generating the communication pattern to increase the degree of parallelism. In addition, because of the difference in execution time among kernels, the load balancing can be considered for an optimization for the algorithm. To evaluate the effectiveness of the optimized algorithm, a k-means application was developed and parallelized especially for the feedback processing. The results show that the parallel execution using two nodes of a graphics processing unit (GPU) cluster obtained 1.5 times speedup. With load balancing, the parallel execution using four nodes of the cluster obtained up to 3.5 times speedup in 2D-FFT and 3.0 times speedup in LU decomposition, compared to the execution on a single GPU.