Posts

Part I - Toward NNGP and NTK
^[draft]

Neural Network Gaussian Process(NNGP) Model the neural network as GP, aka neural network Gaussian Process(NNGP). Intuitively, the kernel of NNGP compute the distance between the output vectors of 2 input data points. We define the following functions as neural networks with fully-conntected layers: $$z_{i}^{1}(x) = b_i^{1} + \sum_{j=1}^{N_1} \ W_{ij}^{1}x_j^1(x), \ \ x_{j}^{1}(x) = \phi(b_i^{0} + \sum_{k=1}^{d_{in}} \ W_{ik}^{0}x_k(x))$$ where $b_i^{1}$ is the $i$th-bias of the second layer(the same as first hidden layer), $W_{ij}^{1}$ is the $i$th-weights of the first layer(the same as input layer) , function $\phi$ is the activation function, and $x$ is the input data of the neural network....

Part III - From AlphaGo to MuZero
^[draft]

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model It is just the paper proposing MuZero. MuZero is quite famous when I write this note(Jan 2021). Lots of people tried to reproduce the incredible performance of the paper. Some of well-known implementations like muzero-general give a clear and modular implementation of MuZero. If you are interested in MuZero, you can play with it. Well, let’s diving into the paper....

Part II - From AlphaGo to MuZero
^[draft]

Mastering the game of Go without human knowledge The paper propose AlphaGo Zero which is known as self-playing without human knowledge. Reinforcement learning in AlphaGo Zero $$ (p, v) = f_{\theta} $$ $$ l = (z - v)^2 - \pi^T log(p) + c||\theta||^2 $$ Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm The paper propose AlphaZero which is known as self-playing to compete any kinds of board game....

Simple Guide Of VDN And QMIX
^[draft]

Value-Decomposition Network(VDN) QMIX Problem Setup And Assumption Constraint The QMIX imporve the VDN algorithm via give a more general form of the contraint. It defines the contraint like $$\frac{\partial Q_{tot}}{\partial Q_{a}} \geq 0, \forall a$$ where $Q_{tot}$ is the joint value function and $Q_{a}$ is the value function for each agent. An intuitive eplaination is that we want the weights of any individual value function $Q_{a}$ are positive. If the weights of individual value function $Q_{a}$ are negative, it will discourage the agent to cooperate, since the higher $Q_{a}$, the lower joint value $Q_{tot}$....

A Guide Of Variational Lower Bound
^[draft]

Problem Setup The Variational Lower Bound is also knowd as Evidence Lower Bound(ELBO) or VLB. It is quite useful that we can derive a lower bound of a model containing a hidden variable. Futhermore, we can even maximize the bound to maximize the log probability. We can assume that $X$ are observations (data) and $Z$ are hidden/latent variables which is unobservable. In general, we can also imagine $Z$ as a parameter and the relationship between $Z$ and $X$ are represented as the following...

Some Intuition Of MLE, MAP, and Bayesian Estimation
^[draft]

The main different between 3 kinds of estimation is What do we assume for the prior? The Maximum Likelihood Estimation(MLE) doesn’t use any prior but only maiximize the probability according to the samples. On the other hand, MAP and Bayesian both use priors to estimate the probability. The Maximum A Posteriori(MAP) only use the probability of single event while Bayesian Estimation see a distribution as the prior. To be continue…...

Part I - From AlphaGo to MuZero
^[draft]

AlphaGo is quite famous when I was a freshman of college. It somehow is the reason that I was addicted to Reinforcement Learning. Thus Our journey of model-based RL will start here. Although it is not the first one that propose model-based RL, I still believe it will give a big picture of model-based RL. Mastering the game of Go with deep neural networks and tree search Introduction AlphaGo combines 2 kinds of model, including policy network and value network....

A Glimpse of Distributional RL
^[draft]

Introduction In traditional reinforcement learning, an agent predict a value for the state-action pair. The distributional RL predicts a distribution of value for the pair. The advantages of distributional RL is that the agent can improve the estimation with more information and quickly. In the mean time, the agent can be sensitive to the risk of the action. It’s very useful for some application like safe reinforcement learning, self-driving car etc…...

An Introduction to Multi-Armed Bandit Problem
^[draft]

Multi-Armed Bandit Problem Imagine you are in a casionoand face multiple slot machines. Each machine is configured with an unknown probability of how likely you would get a reward at one play. The question is What’s the strategy to get the highest long-term reward? An illustration of multi-armed bandit problem, refer to Lil’Log The Multi-Armed Bandit Problem and Its Solutions Definition Upper Confidence Bounds(UCB) The UCB algorithm give a realtion between upper bound and probability confidence....