Reinforcement Mechanism Design for e-commerce

Cai, Qingpeng; Filos-Ratsikas, Aris; Tang, Pingzhong; Zhang, Yiwei

doi:10.1145/3178876.3186039

Cited by 70 publications

(65 citation statements)

References 33 publications

(42 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the bidding strategies optimization in online advertising could be modeled as a sequential decision problem, several works utilized RL methods to solve it. Cai et al [4] formulated the impression allocation problem as an MDP and solved it by an actor-critic policy gradient algorithm based on DDPG. Cai et al [3] formulated a Markov Decision Process framework to learn sequential allocation of campaign budgets.…”

Section: Rl Methods For Bidding Strategiesmentioning

confidence: 99%

Learning to Advertise for Organic Traffic Maximization in E-Commerce Product Feeds

Chen

Jin

Zhang

et al. 2019

Proceedings of the 28th ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

Most e-commerce product feeds provide blended results of advertised products and recommended products to consumers. e underlying advertising and recommendation platforms share similar if not exactly the same set of candidate products. Consumers behaviors on the advertised results constitute part of the recommendation model's training data and therefore can in uence the recommended results. We refer to this process as Leverage. Considering this mechanism, we propose a novel perspective that advertisers can strategically bid through the advertising platform to optimize their recommended organic tra c. By analyzing the realworld data, we rst explain the principles of Leverage mechanism, i.e., the dynamic models of Leverage. en we introduce a novel Leverage optimization problem and formulate it with a Markov Decision Process. To deal with the sample complexity challenge in model-free reinforcement learning, we propose a novel Hybrid Training Leverage Bidding (HTLB) algorithm which combines the real-world samples and the emulator-generated samples to boost the learning speed and stability. Our o ine experiments as well as the results from the online deployment demonstrate the superior performance of our approach.

show abstract

Section: Rl Methods For Bidding Strategiesmentioning

confidence: 99%

Learning to Advertise for Organic Traffic Maximization in E-Commerce Product Feeds

Chen

Jin

Zhang

et al. 2019

Proceedings of the 28th ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

show abstract

“…But because π is not permutation invariant, we find a policy π * (P(c)) = π ((P * P T P)(c)) that is permutation invariant, where P * = arg max P ∈ P R(P(c) π (P (c)) ), then R(c π * (c) ) = R(P * (c) π (P * (c)) ) > 1 |P | P ∈ P R(P(c) π (P (c)) ), (8) which leads to a contradictory to (6) and (7). So it must be that Lemma 1 holds.…”

Section: Definition 1 (Permutation Invariant Policy)mentioning

confidence: 99%

“…The state is then transitioned into the next state. Such a model is tailored for a wide range of important realistic applications such as personalized recommender systems where users' preferences are regarded as states and items are regarded as items with contexts [20,26], and e-commerce where the private information (e.g., cost, reputation) of sellers can be viewed as states and different commercial strategies are regarded as contexts [7].…”

Section: Introductionmentioning

confidence: 99%

Policy Gradients for Contextual Recommendations

Pan

Cai

Tang

et al. 2019

The World Wide Web Conference

Self Cite

View full text Add to dashboard Cite

Decision making is a challenging task in online recommender systems. The decision maker often needs to choose a contextual item at each step from a set of candidates. Contextual bandit algorithms have been successfully deployed to such applications, for the tradeoff between exploration and exploitation and the state-of-art performance on minimizing online costs. However, the applicability of existing contextual bandit methods is limited by the over-simplified assumptions of the problem, such as assuming a simple form of the reward function or assuming a static environment where the states are not affected by previous actions.In this work, we put forward Policy Gradients for Contextual Recommendations (PGCR) to solve the problem without those unrealistic assumptions. It optimizes over a restricted class of policies where the marginal probability of choosing an item (in expectation of other items) has a simple closed form, and the gradient of the expected return over the policy in this class is in a succinct form. Moreover, PGCR leverages two useful heuristic techniques called Time-Dependent Greed and Actor-Dropout. The former ensures PGCR to be empirically greedy in the limit, and the latter addresses the trade-off between exploration and exploitation by using the policy network with Dropout as a Bayesian approximation.PGCR can solve the standard contextual bandits as well as its Markov Decision Process generalization. Therefore it can be applied to a wide range of realistic settings of recommendations, such as personalized advertising. We evaluate PGCR on toy datasets as well as a real-world dataset of personalized music recommendations. Experiments show that PGCR enables fast convergence and low regret, and outperforms both classic contextual-bandits and vanilla policy gradient methods.

show abstract

“…The previous name is: Learning to Advertise with Adaptive Exposure via Constrained Two-Level Reinforcement Learning. successful applications of DRL techniques to optimize the decisionmaking process in E-commerce from different aspects including online recommendation [11], impression allocation [10,41], advertising bidding strategies [19,37,40] and product ranking [16].…”

Section: Introductionmentioning

confidence: 99%

Learning Adaptive Display Exposure for Real-Time Advertising

Wang

Jin

Hao

et al. 2019

Proceedings of the 28th ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

In E-commerce advertising, where product recommendations and product ads are presented to users simultaneously, the traditional setting is to display ads at fixed positions. However, under such a setting, the advertising system loses the flexibility to control the number and positions of ads, resulting in sub-optimal platform revenue and user experience. Consequently, major e-commerce platforms (e.g., Taobao.com) have begun to consider more flexible ways to display ads. In this paper, we investigate the problem of advertising with adaptive exposure: can we dynamically determine the number and positions of ads for each user visit under certain business constraints so that the platform revenue can be increased? More specifically, we consider two types of constraints: requestlevel constraint ensures user experience for each user visit, and platform-level constraint controls the overall platform monetization rate. We model this problem as a Constrained Markov Decision Process with per-state constraint (psCMDP) and propose a constrained two-level reinforcement learning approach to decompose the original problem into two relatively independent sub-problems. To accelerate policy learning, we also devise a constrained hindsight experience replay mechanism. Experimental evaluations on industry-scale real-world datasets demonstrate the merits of our approach in both obtaining higher revenue under the constraints and the effectiveness of the constrained hindsight experience replay mechanism.

show abstract

Reinforcement Mechanism Design for e-commerce

Cited by 70 publications

References 33 publications

Learning to Advertise for Organic Traffic Maximization in E-Commerce Product Feeds

Learning to Advertise for Organic Traffic Maximization in E-Commerce Product Feeds

Policy Gradients for Contextual Recommendations

Learning Adaptive Display Exposure for Real-Time Advertising

Contact Info

Product

Resources

About