In this paper, a novel reinforcement learning (RL) approach with cell sectoring is proposed to solve the channel and power allocation issue for a device-to-device (D2D)-enabled cellular network when the prior traffic information is not known to the base station (BS). Further, this paper explores an optimal policy for resource and power allocation between users intending to maximize the sum-rate of the overall system. Since the behavior of wireless channel and traffic request of users in the system is stochastic in nature, the dynamic property of the environment allows us to employ an actor-critic RL technique to learn the best policy through continuous interaction with the surrounding. The proposed work comprises of four phases: cell splitting, clustering, queuing model, and channel allocation and power allocation simultaneously using an actor-critic RL.The implementation of cell splitting with novel clustering technique increases the network coverage, reduces co-channel cell interference, and minimizes the transmission power of nodes, whereas the queuing model solves the issue of waiting time for users in a priority-based data transmission. With the help of continuous state-action space, the actor-critic RL algorithm based on policy gradient improves the overall system sum-rate as well as the D2D throughput. The actor adopts a parameter-based stochastic policy for giving continuous action while the critic estimates the policy and criticizes the actor for the action. This reduces the high variance of the policy gradient. Through numerical simulations, the benefit of our resource sharing scheme over other existing traditional scheme is verified. KEYWORDS actor-critic reinforcement learning, cell sectoring, device-to-device communication, k-means clustering, queuing model, resource allocation Int J Commun Syst. 2020;33:e4315. wileyonlinelibrary.com/journal/dac KHUNTIA ET AL.D2D communication allows direct communication between two users in close proximity, without the involvement of the base station (BS). In an underlaying cellular network, D2D users (D2Ds) reuse radio resources allocated to cellular users (CUs). The reuse of resources of a cellular user (CU) by D2D user causes interference with each other. 1,2 Therefore, the selection of a suitable resource and power allocation scheme plays a vital role in reducing interference. So, in order to reduce interference, a suitable amount of transmission power must be chosen for each CU and D2D. Thus, as a central entity, BS determines the transmission power of each user and interference level using various scheduling algorithms. 3 But, the traditional method of resource allocation does not provide a preferable optimal outcome, if the complete traffic information is not known to the BS, a priori. There are various conventional D2D schemes, which aims at maximizing the network throughput. Some of the schemes are graph-based method, fractional frequency reuse, 4 Lagrange multiplier, 5 and optimization algorithm, eg, particle swarm optimization and genetic algorithm (GA). ...