RL | Golden Hat

A Paper Review: Learning to Adapt
^[draft]

Introduciton Propose an efficient method for online adaptation. The algorithm efficiently trains a global model that is capable of using its recent experiences to quickly adapt, achieving fast online adaptation in dynamic environments. They evaluate 2 version of approaches on stochastic continuous control tasks: (1) Recurrence-Based Adaptive Learner (ReBAL) (2) Gradient-Based Adaptive Learner (GrBAL) Objective Setting-Up To adapt the dynamic environment, we require a learned model $p_{\theta}^$ to adapt, using an update rule $u_{\psi}^$ after seeing M data points from some new “task”....

Part III - From AlphaGo to MuZero
^[draft]

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model It is just the paper proposing MuZero. MuZero is quite famous when I write this note(Jan 2021). Lots of people tried to reproduce the incredible performance of the paper. Some of well-known implementations like muzero-general give a clear and modular implementation of MuZero. If you are interested in MuZero, you can play with it. Well, let’s diving into the paper....

Part II - From AlphaGo to MuZero
^[draft]

Mastering the game of Go without human knowledge The paper propose AlphaGo Zero which is known as self-playing without human knowledge. Reinforcement learning in AlphaGo Zero $$ (p, v) = f_{\theta} $$ $$ l = (z - v)^2 - \pi^T log(p) + c||\theta||^2 $$ Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm The paper propose AlphaZero which is known as self-playing to compete any kinds of board game....

Simple Guide Of VDN And QMIX
^[draft]

Value-Decomposition Network(VDN) QMIX Problem Setup And Assumption Constraint The QMIX imporve the VDN algorithm via give a more general form of the contraint. It defines the contraint like $$\frac{\partial Q_{tot}}{\partial Q_{a}} \geq 0, \forall a$$ where $Q_{tot}$ is the joint value function and $Q_{a}$ is the value function for each agent. An intuitive eplaination is that we want the weights of any individual value function $Q_{a}$ are positive. If the weights of individual value function $Q_{a}$ are negative, it will discourage the agent to cooperate, since the higher $Q_{a}$, the lower joint value $Q_{tot}$....

Part I - From AlphaGo to MuZero
^[draft]

AlphaGo is quite famous when I was a freshman of college. It somehow is the reason that I was addicted to Reinforcement Learning. Thus Our journey of model-based RL will start here. Although it is not the first one that propose model-based RL, I still believe it will give a big picture of model-based RL. Mastering the game of Go with deep neural networks and tree search Introduction AlphaGo combines 2 kinds of model, including policy network and value network....

An Introduction to Multi-Armed Bandit Problem
^[draft]

Multi-Armed Bandit Problem Imagine you are in a casionoand face multiple slot machines. Each machine is configured with an unknown probability of how likely you would get a reward at one play. The question is What’s the strategy to get the highest long-term reward? An illustration of multi-armed bandit problem, refer to Lil’Log The Multi-Armed Bandit Problem and Its Solutions Definition Upper Confidence Bounds(UCB) The UCB algorithm give a realtion between upper bound and probability confidence....