Introduciton Propose an efficient method for online adaptation. The algorithm efficiently trains a global model that is capable of using its recent experiences to quickly adapt, achieving fast online adaptation in dynamic environments.
They evaluate 2 version of approaches on stochastic continuous control tasks:
(1) Recurrence-Based Adaptive Learner (ReBAL)
(2) Gradient-Based Adaptive Learner (GrBAL)
Objective Setting-Up To adapt the dynamic environment, we require a learned model Missing superscript or subscript argument to adapt, using an update rule Missing superscript or subscript argument after seeing M data points from some new “task”....
Part III - From AlphaGo to MuZero
[draft]
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model It is just the paper proposing MuZero. MuZero is quite famous when I write this note(Jan 2021). Lots of people tried to reproduce the incredible performance of the paper. Some of well-known implementations like muzero-general give a clear and modular implementation of MuZero. If you are interested in MuZero, you can play with it. Well, let’s diving into the paper....
Part II - From AlphaGo to MuZero
[draft]
Mastering the game of Go without human knowledge The paper propose AlphaGo Zero which is known as self-playing without human knowledge.
Reinforcement learning in AlphaGo Zero
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm The paper propose AlphaZero which is known as self-playing to compete any kinds of board game....
Simple Guide Of VDN And QMIX
[draft]
Value-Decomposition Network(VDN) QMIX Problem Setup And Assumption Constraint The QMIX imporve the VDN algorithm via give a more general form of the contraint. It defines the contraint like
where is the joint value function and is the value function for each agent.
An intuitive eplaination is that we want the weights of any individual value function are positive. If the weights of individual value function are negative, it will discourage the agent to cooperate, since the higher , the lower joint value ....
Part I - From AlphaGo to MuZero
[draft]
AlphaGo is quite famous when I was a freshman of college. It somehow is the reason that I was addicted to Reinforcement Learning. Thus Our journey of model-based RL will start here. Although it is not the first one that propose model-based RL, I still believe it will give a big picture of model-based RL.
Mastering the game of Go with deep neural networks and tree search Introduction AlphaGo combines 2 kinds of model, including policy network and value network....
An Introduction to Multi-Armed Bandit Problem
[draft]
Multi-Armed Bandit Problem Imagine you are in a casionoand face multiple slot machines. Each machine is configured with an unknown probability of how likely you would get a reward at one play. The question is What’s the strategy to get the highest long-term reward?
An illustration of multi-armed bandit problem, refer to Lil’Log The Multi-Armed Bandit Problem and Its Solutions
Definition Upper Confidence Bounds(UCB) The UCB algorithm give a realtion between upper bound and probability confidence....