Feiyang Pan, Jia He, Dandan Tu, Qing He
It is a popular belief that model-based Reinforcement Learning (RL) is more sample efficient than model-free RL, but in practice, it is not always true due to overweighed model errors. In complex and noisy settings, model-based RL tends to have trouble using the model if it does not know when to trust the model.
In this work, we find that better model usage can make a huge difference. We propose Masked Model-based Actor-Critic (M2AC), a novel policy optimization algorithm that maximizes a model-based lower-bound of the true value function. Experiments on continuous control benchmarks demonstrate that M2AC has strong performance even when using long model rollouts in very noisy environments, and it significantly outperforms previous state-of-the-art methods. [Paper]
Feiyang Pan, Qingpeng Cai, An-Xiang Zeng, Chun-Xiang Pan, Qing Da, Hualin He, Qing He, Pingzhong Tang
Model-free reinforcement learning methods such as the Proximal Policy Optimization algorithm (PPO) have successfully applied in complex decision-making problems such as Atari games. However, these methods suffer from high variances and high sample complexity. On the other hand, model-based reinforcement learning methods that learn the transition dynamics are more sample efficient, but they often suffer from the bias of the transition estimation. How to make use of both model-based and model-free learning is a central problem in reinforcement learning.
In this paper, we present a new technique to address the tradeoff between exploration and exploitation, which regards the difference between model-free and model-based estimations as a measure of exploration value. We apply this new technique to the PPO algorithm and arrive at a new policy optimization method, named Policy Optimization with Modelbased Explorations (POME). Experiments show that POME outperforms PPO on 33 games out of 49 games. [Paper]
Feiyang Pan, Qingpeng Cai, Pingzhong Tang, Fuzhen Zhuang, Qing He
Decision making is a challenging task in online recommender systems. However, the applicability of existing contextual bandit methods is limited by the over-simplified assumptions of the problem, such as assuming a simple form of the reward function or assuming a static environment where the states are not affected by previous actions. In this work, we put forward Policy Gradients for Contextual Recommendations (PGCR) to solve the problem without those unrealistic assumptions. PGCR can solve the standard contextual bandits as well as its Markov Decision Process generalization. We evaluate PGCR on toy datasets as well as a real-world dataset of personalized music recommendations. Experiments show that PGCR enables fast convergence and low regret, and outperforms both classic contextual-bandits and vanilla policy gradient methods. [Paper]
Web page was made with Mobirise