Proceedings of the 2nd Workshop on Structured Prediction for Natural Language Processing 2017
DOI: 10.18653/v1/w17-4304
|View full text |Cite
|
Sign up to set email alerts
|

Structured Prediction via Learning to Search under Bandit Feedback

Abstract: We present an algorithm for structured prediction under online bandit feedback. The learner repeatedly predicts a sequence of actions, generating a structured output. It then observes feedback for that output and no others. We consider two cases: a pure bandit setting in which it only observes a loss, and more fine-grained feedback in which it observes a loss for every action. We find that the fine-grained feedback is necessary for strong empirical performance, because it allows for a robust variance-reduction… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
8

Relationship

1
7

Authors

Journals

citations
Cited by 9 publications
(9 citation statements)
references
References 16 publications
(14 reference statements)
0
9
0
Order By: Relevance
“…However, the analysis by Choshen et al (2020) missed a few crucial aspects of RL that have led to empirical success in previous works: First, variance reduction techniques such as the average reward baseline were already proposed with the original Policy Gradient by Williams (1992), and proved effective for NMT Nguyen et al, 2017). Second, the exploration-exploitation trade-off can be controlled by modifying the sampling function (Sharaf and Daumé III, 2017), which in turn influences the peakiness.…”
Section: Introductionmentioning
confidence: 99%
“…However, the analysis by Choshen et al (2020) missed a few crucial aspects of RL that have led to empirical success in previous works: First, variance reduction techniques such as the average reward baseline were already proposed with the original Policy Gradient by Williams (1992), and proved effective for NMT Nguyen et al, 2017). Second, the exploration-exploitation trade-off can be controlled by modifying the sampling function (Sharaf and Daumé III, 2017), which in turn influences the peakiness.…”
Section: Introductionmentioning
confidence: 99%
“…Hence, they naturally generate an ordered sequence of frames, while the attention mechanism fuses the multi-modal information to select the next best frame satisfying diversity, query-relevance and visual coherence (Figure 3). We train the Pointer Network in our model using reinforcement learning, as it is useful for tasks with limited labeled data [4,5,19,30,36,40,56], as in the case of QAMVS.…”
Section: Pointer Networkmentioning
confidence: 99%
“…Imitation learning algorithms are a great fit for training agents in simulated environments: access to ground-truth information about the environments allows optimal actions to be computed in many situations. The "teacher" in standard imitation learning algorithms (Daumé III et al, 2009;Ross et al, 2011;Ross and Bagnell, 2014;Chang et al, 2015;Sun et al, 2017;Sharaf and Daumé III, 2017) et al, 2019) models an advisor who is always present to help but speaks simple, templated language. CVDN (Thomason et al, 2019b) contains natural conversations in which a human assistant aids another human in navigation tasks but offers limited language interaction simulation, as language assistance is not available when the agent deviates from the collected trajectories and tasks.…”
Section: Related Workmentioning
confidence: 99%