2015
DOI: 10.1007/s40595-015-0045-x
|View full text |Cite|
|
Sign up to set email alerts
|

A multi-agent cooperative reinforcement learning model using a hierarchy of consultants, tutors and workers

Abstract: The hierarchical organisation of distributed systems can provide an efficient decomposition for machine learning. This paper proposes an algorithm for cooperative policy construction for independent learners, named Q-learning with aggregation (QA-learning). The algorithm is based on a distributed hierarchical learning model and utilises three specialisations of agents: workers, tutors and consultants. The consultant agent incorporates the entire system in its problem space, which it decomposes into sub-problem… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 43 publications
(16 citation statements)
references
References 25 publications
0
16
0
Order By: Relevance
“…To the best of our knowledge, there is no RLbased method that attempts to infer the optimal behavior of UVs. Therefore, we builtR by using Q-learning algorithm estimating the cumulative reward of each situation-action pair in a Q-table, which is a basic method of RL [30]. Its performance largely depends on the environment design, and we utilized the same settings as that of AM and MF.…”
Section: B Experiments Settingmentioning
confidence: 99%
“…To the best of our knowledge, there is no RLbased method that attempts to infer the optimal behavior of UVs. Therefore, we builtR by using Q-learning algorithm estimating the cumulative reward of each situation-action pair in a Q-table, which is a basic method of RL [30]. Its performance largely depends on the environment design, and we utilized the same settings as that of AM and MF.…”
Section: B Experiments Settingmentioning
confidence: 99%
“…An MDP comprises a set of states } ,..., , { = indicates that the transition is invalid. The immediate expected reward for executing this transition is the deterministic reward ) , ( z x a s R [3]. It is important to note that the implementation of Q-learning to stochastic MDPs is beyond the scope of this paper.…”
Section: Q-learningmentioning
confidence: 99%
“…As a consequence, AVE-Q may produce an incorrect policy, because it does not remove the bad Qvalues at the interaction stage [3].…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations