Introduciton Propose an efficient method for online adaptation. The algorithm efficiently trains a global model that is capable of using its recent experiences to quickly adapt, achieving fast online adaptation in dynamic environments.
They evaluate 2 version of approaches on stochastic continuous control tasks:
(1) Recurrence-Based Adaptive Learner (ReBAL)
(2) Gradient-Based Adaptive Learner (GrBAL)
Objective Setting-Up To adapt the dynamic environment, we require a learned model $p_{\theta}^$ to adapt, using an update rule $u_{\psi}^$ after seeing M data points from some new “task”....
Part III - From AlphaGo to MuZero
[draft]
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model It is just the paper proposing MuZero. MuZero is quite famous when I write this note(Jan 2021). Lots of people tried to reproduce the incredible performance of the paper. Some of well-known implementations like muzero-general give a clear and modular implementation of MuZero. If you are interested in MuZero, you can play with it. Well, let’s diving into the paper....
Part II - From AlphaGo to MuZero
[draft]
Mastering the game of Go without human knowledge The paper propose AlphaGo Zero which is known as self-playing without human knowledge.
Reinforcement learning in AlphaGo Zero $$ (p, v) = f_{\theta} $$
$$ l = (z - v)^2 - \pi^T log(p) + c||\theta||^2 $$
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm The paper propose AlphaZero which is known as self-playing to compete any kinds of board game....
Simple Guide Of VDN And QMIX
[draft]
Value-Decomposition Network(VDN) QMIX Problem Setup And Assumption Constraint The QMIX imporve the VDN algorithm via give a more general form of the contraint. It defines the contraint like
$$\frac{\partial Q_{tot}}{\partial Q_{a}} \geq 0, \forall a$$
where $Q_{tot}$ is the joint value function and $Q_{a}$ is the value function for each agent.
An intuitive eplaination is that we want the weights of any individual value function $Q_{a}$ are positive. If the weights of individual value function $Q_{a}$ are negative, it will discourage the agent to cooperate, since the higher $Q_{a}$, the lower joint value $Q_{tot}$....
Part I - From AlphaGo to MuZero
[draft]
AlphaGo is quite famous when I was a freshman of college. It somehow is the reason that I was addicted to Reinforcement Learning. Thus Our journey of model-based RL will start here. Although it is not the first one that propose model-based RL, I still believe it will give a big picture of model-based RL.
Mastering the game of Go with deep neural networks and tree search Introduction AlphaGo combines 2 kinds of model, including policy network and value network....
An Introduction to Multi-Armed Bandit Problem
[draft]
Multi-Armed Bandit Problem Imagine you are in a casionoand face multiple slot machines. Each machine is configured with an unknown probability of how likely you would get a reward at one play. The question is What’s the strategy to get the highest long-term reward?
An illustration of multi-armed bandit problem, refer to Lil’Log The Multi-Armed Bandit Problem and Its Solutions
Definition Upper Confidence Bounds(UCB) The UCB algorithm give a realtion between upper bound and probability confidence....