2020
DOI: 10.1109/access.2020.3038923
|View full text |Cite
|
Sign up to set email alerts
|

Deep Reinforcement Based Power Allocation for the Max-Min Optimization in Non-Orthogonal Multiple Access

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 12 publications
(10 citation statements)
references
References 34 publications
0
7
0
Order By: Relevance
“…MDPs consist of states, actions, and a reward function definition. In [21], the state of an MDP model includes the power-coefficients values, data-rate of users, and vectors indicating which of the powercoefficients can be increased or decreased. In [14], the channel conditions and transmission power are considered as state and action, respectively, and both of them have continuous space.…”
Section: A Motivation and Contributionmentioning
confidence: 99%
“…MDPs consist of states, actions, and a reward function definition. In [21], the state of an MDP model includes the power-coefficients values, data-rate of users, and vectors indicating which of the powercoefficients can be increased or decreased. In [14], the channel conditions and transmission power are considered as state and action, respectively, and both of them have continuous space.…”
Section: A Motivation and Contributionmentioning
confidence: 99%
“…The design of the reward ρ(s n , a n ) is related to the goal of the original design problem (P1) which attempts to maximize the worst sum rate of WNs. Here, we consider three kinds of reward designs for ρ(s n , a n ), namely worst accumulated sum rate among users (WASR), difference of worst accumulated sum rate among users at two adjacent time slots (DWASR) [29], and instantaneous average sum rate of users (ISR), which are in turn defined as follows:…”
Section: Reward Designmentioning
confidence: 99%
“…represents the accumulated sum data rate for the kth WN up to the (n−1)th time, and R k,n is calculated by (16) with respect to the action a n and the state s n at the time instant n. The idea of the WASR method is to simply use the worst accumulated sum rate of the WNs as the reward according to the objective function of the optimization problem (P1). The DWASR method is conceptualized in accordance with [29], for which the difference of the worst accumulated sum rate at two adjacent time slots n and n − 1 is computed as the reward. The ISR method is proposed in this paper to take the instantaneous average sum rate of all WNs as the reward.…”
Section: Reward Designmentioning
confidence: 99%
“…The number of neurons in the input layers for the DNNs are 2(1 + L) + K as we explained in the description of state s t n,k , see (17). We set L = 20 therefore there are 52 neurons in the input layer.…”
Section: B Hyper-parameter Selectionmentioning
confidence: 99%
“…In reinforcement learning, the idea is to determine how agents ought to take actions by observing an environment such that a cumulative reward is maximized [13]. Several deep reinforcement learning (DRL)-based methods have been applied to solve power optimization problems in cellular networks, e.g., [14]- [17]. However, none of these works considered massive MIMO systems.…”
Section: Introductionmentioning
confidence: 99%