Retrieval-Augmented Reinforcement Learning

Goyal, Anirudh; Friesen, Abram L.; Banino, Andrea; Weber, Théophane; Ke, Nan Rosemary; Badia, Adrià Puigdomènech; Guez, Arthur; Mirza, Mehdi; Humphreys, Peter C.; Konyushkova, Ksenia; Sifre, Laurent; Valko, Michal; Osindero, Simon; Lillicrap, Timothy P.; Heess, Nicolas; Blundell, Charles

doi:10.48550/arxiv.2202.08417

Cited by 3 publications

(7 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We randomly zero-out a subset of retrieved neighbors during training ("neighbor dropout"), and/or more adversarially, randomly replace a subset of retrieved neighbors with the neighbors of a different observation ("neighbor randomisation"). Inspired by [10], we also explore using a loss to regularise the embedding produced by the neighbor retrieval towards the embedding produced with the observation alone ("neighbor regularisation"). Further details are given in Sec.…”

Section: Regularisationmentioning

confidence: 99%

“…through specifying the agent's action-value directly in terms of previously generated value estimates [6,13,15,28], or a model from observed transitions [38]) but rather learn end-to-end how the data can support better predictions within the parametric model. A recent approach by Goyal et al [10] has considered an attention mechanism to select where and what to use from available trajectories, but over a small retrieval batch of data rather than the full available experience data. Another class of method to leverage a transition dataset is to replay the data at training time in order to perform more gradient steps per experience, this is a widespread technique in modern RL algorithms [21,22,24,35] but it does not benefit the agent at test time, requires additional learning steps to adapt to new data, and does not allow end-to-end learning of how to relate past experience to new situations.…”

Section: Related Workmentioning

confidence: 99%

“…Inspired by [10], we also explore using a loss to regularise the embedding produced by the neighbor retrieval towards the embedding produced with the observation alone. In this case, we apply a convolutional layer to the embedding produced from the game state observation, and use this to predict the output of the neighbor processing tower.…”

Section: Neighbor Regularisationmentioning

confidence: 99%

“…There is no end-to-end means for an agent to attend to information outside of working memory to directly inform its actions. While there has been a significant amount of work focused on increasing the information available from previous experiences within an episode (e.g., recurrent networks, slot-based memory [19,27]), more extensive direct use of more general forms of experience or data has been limited, although some recent works have begun to explore utilising inter-episodic information from the same agent [6,10,28,33,40]. We seek to drastically expand the scale of information that is accessible to an agent, allowing it to attend to tens of millions of pieces of information, while learning in an end-to-end manner how to use this information for decision making.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Large-Scale Retrieval for Reinforcement Learning

Humphreys¹,

Guez²,

Tieleman³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Effective decision making involves flexibly relating past experiences and relevant contextual information to a novel situation. In deep reinforcement learning, the dominant paradigm is for an agent to amortise information that helps decisionmaking into its network weights via gradient descent on training losses. Here, we pursue an alternative approach in which agents can utilise large-scale contextsensitive database lookups to support their parametric computations. This allows agents to directly learn in an end-to-end manner to utilise relevant information to inform their outputs. In addition, new information can be attended to by the agent, without retraining, by simply augmenting the retrieval dataset. We study this approach in Go, a challenging game for which the vast combinatorial state space privileges generalisation over direct matching to past experiences. We leverage fast, approximate nearest neighbor techniques in order to retrieve relevant data from a set of tens of millions of expert demonstration states. Attending to this information provides a significant boost to prediction accuracy and game-play performance over simply using these demonstrations as training trajectories, providing a compelling demonstration of the value of large-scale retrieval in reinforcement learning agents.

show abstract

Section: Regularisationmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Neighbor Regularisationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Large-Scale Retrieval for Reinforcement Learning

Humphreys¹,

Guez²,

Tieleman³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The hallmark of machine learning is to be able to develop models that can quickly adapt to new tasks once trained on sufficiently diverse tasks [5,54]. There are multiple ways to transfer information from one task to another: (1) transfer information via the transfer of the neural network weights (when trained on source tasks); (2) reuse raw data as in retrieval-based methods [6,37,32,33,19,51,17]; or (3), via knowledge distillation [23]. Each approach implies inevitable trade-offs: When directly transferring neural network weights, previous information about the data may be lost in the finetuning process, while transfer via raw data may be prohibitively expensive as there can be hundreds of thousands of past experiences.…”

Section: Related Workmentioning

confidence: 99%

Discrete Key-Value Bottleneck

Träuble¹,

Goyal²,

Rahaman³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Deep neural networks perform well on prediction and classification tasks in the canonical setting where data streams are i.i.d., labeled data is abundant, and class labels are balanced. Challenges emerge with distribution shifts, including non-stationary or imbalanced data streams. One powerful approach that has addressed this challenge involves self-supervised pretraining of large encoders on volumes of unlabeled data, followed by task-specific tuning. Given a new task, updating the weights of these encoders is challenging as a large number of weights needs to be fine-tuned, and as a result, they forget information about the previous tasks. In the present work, we propose a model architecture to address this issue, building upon a discrete bottleneck containing pairs of separate and learnable (key, value) codes. In this setup, we follow the encode; process the representation via a discrete bottleneck; and decode paradigm, where the input is fed to the pretrained encoder, the output of the encoder is used to select the nearest keys, and the corresponding values are fed to the decoder to solve the current task. The model can only fetch and re-use a limited number of these (key, value) pairs during inference, enabling localized and context-dependent model updates. We theoretically investigate the ability of the proposed model to minimize the effect of the distribution shifts and show that such a discrete bottleneck with (key, value) pairs reduces the complexity of the hypothesis class. We empirically verified the proposed methods' benefits under challenging distribution shift scenarios across various benchmark datasets and show that the proposed model reduces the common vulnerability to non-i.i.d. and non-stationary training distributions compared to various other baselines.

show abstract

Complex QA and language models hybrid architectures, Survey

Daull¹,

Bellot²,

Bruno³

et al. 2023

Preprint

View full text Add to dashboard Cite

This paper provides a survey of the state of the art of hybrid language models architectures and strategies for "complex" question-answering (QA, CQA, CPS). Very large language models are good at leveraging public data on standard problems but once you want to tackle more specific complex questions or problems you may need specific architecture, knowledge, skills, tasks, methods, sensitive data, performance, human approval and versatile feedback... This survey extends findings from the robust community edited research papers BIG, BLOOM and HELM which open source, benchmark and analyze limits and challenges of large language models in terms of tasks complexity and strict evaluation on accuracy (e.g. fairness, robustness, toxicity, ...). It identifies the key elements used with Large Language Models (LLM) to solve complex questions or problems. Recent projects like ChatGPT and GALACTICA have allowed non-specialists to grasp the great potential as well as the equally strong limitations of language models in complex QA. Hybridizing these models with different components could allow to overcome these different limits and go much further. We discuss some challenges associated with complex QA, including domain adaptation, decomposition and efficient multi-step QA, long form QA, non-factoid QA, safety and multi-sensitivity data protection, multimodal search, hallucinations, QA explainability and truthfulness, time dimension. Therefore we review current solutions and promising strategies, using elements such as hybrid LLM architectures, human-in-the-loop reinforcement learning, prompting adaptation, neuro-symbolic and structured knowledge grounding, program synthesis, and others. We analyze existing solutions and provide an overview of the current research and trends in the area of complex QA.

show abstract

Retrieval-Augmented Reinforcement Learning

Cited by 3 publications

References 39 publications

Large-Scale Retrieval for Reinforcement Learning

Large-Scale Retrieval for Reinforcement Learning

Discrete Key-Value Bottleneck

Complex QA and language models hybrid architectures, Survey

Contact Info

Product

Resources

About