Generalized-ensemble Monte Carlo simulations such as the multicanonical method and similar techniques are among the most efficient approaches for simulations of systems undergoing discontinuous phase transitions or with rugged freeenergy landscapes. As Markov chain methods, they are inherently serial computationally. It was demonstrated recently, however, that a combination of independent simulations that communicate weight updates at variable intervals allows for the efficient utilization of parallel computational resources for multicanonical simulations. Implementing this approach for the many-thread architecture provided by current generations of graphics processing units (GPUs), we show how it can be efficiently employed with of the order of 10 4 parallel walkers and beyond, thus constituting a versatile tool for Monte Carlo simulations in the era of massively parallel computing. We provide the fully documented source code for the approach applied to the paradigmatic example of the two-dimensional Ising model as starting point and reference for practitioners in the field. Solution method: The code uses a parallel variant of the multicanonical method employing many parallel walkers that accumulate a common histogram. The resulting histogram is used to determine the weight function for the next iteration. Once the iteration has converged, simulations visit all possible energies with the same probability. Restrictions: The system size and size of the population of replicas are limited depending on the memory of the GPU device used. Code repository at https://github.com/CQT-Leipzig/ cudamuca. Running time: Depends on system size (approx. 40 s for L=32 on a Tesla K20m)