Cognitive technology enables licensed users (primary users, PUs) to trade the surplus spectrum and to transfer temporarily spectrum usage right to the unlicensed users (secondary users, SUs) to get some reward. The rented spectrum is used to establish secondary network. However, the rented spectrum size influences the quality of service (QoS) for the PU and the gained rewards. Therefore, the PU needs a resource management scheme that helps it to allocate optimally a given amount of the offered spectrum among multiple service classes and to adapt to changes in the network conditions. The PU should support different classes of SUs that pay different prices for their usage of spectrum. We propose a novel approach to maximize a PU reward and to maintain QoS for the PUs and for the different classes of SUs. These complex contradicting objectives are embedded in our reinforcement learning (RL) model that is developed to derive resource adaptations to changing network conditions, so that PUs' profit can continuously be maximized. Available spectrum is managed by the PU that executes the optimal control policy, which is extracted using RL. Performance evaluation of the proposed RL solution shows that the scheme is able to adapt to different conditions and to guarantee the required QoS for PUs and to maintain the QoS for a multiple classes of SUs, while maximizing PUs profits. The results have shown that cognitive mesh network can support additional SUs traffic while still ensuring PUs QoS. In our model, PUs exchange channels based on the spectrum demand and traffic load. The solution is extended to the case in which there are multiple PUs in the network where a new distributed algorithm is proposed to dynamically manage spectrum allocation among PUs.