Current beam orientation optimization algorithms for radiotherapy, such as column generation (CG), are typically heuristic or greedy in nature because of the size of the combinatorial problem, which leads to suboptimal solutions. We propose a reinforcement learning strategy using a Monte Carlo Tree Search (MCTS) that can find a better beam orientation set in less time than CG. We utilize a reinforcement learning structure involving a supervised learning network to guide the MCTS and to explore the decision space of beam orientation selection problems. We previously trained a deep neural network (DNN) that takes in the patient anatomy, organ weights, and current beams, then approximates beam fitness values to indicate the next best beam to add. Here, we use this DNN to probabilistically guide the traversal of the branches of the Monte Carlo decision tree to add a new beam to the plan. To assess the feasibility of the algorithm, we used a test set of 13 prostate cancer patients, distinct from the 57 patients originally used to train and validate the DNN, to solve five-beam plans. To show the strength of the guided MCTS (GTS) compared to other search methods, we also provided the performances of Guided Search, Uniform Tree Search and Random Search algorithms. On average, GTS outperformed all the other methods. It found a better solution than CG in 237 s on average, compared to 360 s for CG, and outperformed all other methods in finding a solution with a lower objective function value in less than 1000 s. Using our GTS method, we could maintain planning target volume (PTV) coverage within 1% error similar to CG, while reducing the organ-at-risk mean dose for body, rectum, left and right femoral heads; the mean dose to bladder was 1% higher with GTS than with CG.