Solving production scheduling problems is a difficult and indispensable task for manufacturers with a push-oriented planning approach. In this study, we tackle a novel production scheduling problem from a household appliance production at the company Miele & Cie. KG, namely a two-stage permutation flow shop scheduling problem (PFSSP) with a finite buffer and sequence-dependent setup efforts. The objective is to minimize idle times and setup efforts in lexicographic order. In extensive and realistic data, the identification of exact solutions is not possible due to the combinatorial complexity. Therefore, we developed a reinforcement learning (RL) approach based on the Proximal Policy Optimization (PPO) algorithm that integrates domain knowledge through reward shaping, action masking, and curriculum learning to solve this PFSSP. Benchmarking of our approach with a state-of-the-art genetic algorithm (GA) showed significant superiority. Our work thus provides a successful example of the applicability of RL in real-world production planning, demonstrating not only its practical utility but also showing the technical and methodological integration of the agent with a discrete event simulation (DES). We also conducted experiments to investigate the impact of individual algorithmic elements and a hyperparameter of the reward function on the overall solution.