From EM To VBEM

1. Introduction When we use K-Means or GMM to solve clustering problem, the most important hyperparameter is the number of the cluster. It is quite hard to decide and cause the good/bad performance significantly. In the mean time, K-Means also cannot handle unbalanced dataset well. However, the variational Bayesian Gaussian mixture model(VB-GMM) can solve these. VB-GMM is a Bayesian model that contains priors over the parameters of GMM. Thus, VB-GMM can be optimized by variational Bayesian expectation maximization(VBEM) and find the optimal cluster number automatically....

July 9, 2021 · 6 min · SY Chou

A Review of SVM and SMO

Note: full code is on my github. 1. Abstract In this article, I will derive SMO algorithm and the Fourier kernel approximation which are well-known algorithm for kernel machine. SMO can solve optimization problem of SVM efficiently and the Fourier kernel approximation is a kind of kernel approximation that can speed up the computation of the kernel matrix. In the last section, I will conduct a evaluation of my manual SVM on the simulation dataset and “Women’s Clothing E-Commerce Review Dataset”....

July 8, 2021 · 17 min · SY Chou

Part II - Toward NNGP and NTK

Neural Tangent Kernel(NTK) “In short, NTK represent the changes of the weights before and after the gradient descent update” Let’s start the journey of revealing the black-box neural networks. Setup a Neural Network First of all, we need to define a simple neural network with 2 hidden layers $$ y(x, w)$$ where $y$ is the neural network with weights $w \in \mathbb{R}^m$ and, ${ x, \bar{y} }_N$ is the dataset which is a set of the input data and the output data with $N$ data points....

February 19, 2021 · 10 min · SY Chou

A Very Brief Introduction to Gaussian Process and Bayesian Optimization

Gaussian Process Big Picture and Background Intuitively, Gaussian distribution define the state space, while Gaussian Process define the function space Before we introduce Gaussian process, we should understand Gaussian distriution at first. For a RV(random variable) $X$ that follow Gaussian Distribution $\mathcal{N}(0, 1)$ should be following image: The P.D.F should be $$x \sim \mathcal{N}(\mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} e^{- \frac{1}{2} (\frac{- \mu}{\sigma})^2}$$ As for Multivariate Gaussian Distribution, given 2 RV $x$, $y$ both 2 RV follow Gaussian Distribution $\mathcal{N}(0, 1)$ we can illustrate it as...

February 16, 2021 · 12 min · SY Chou

Toward VB-GMM
  [draft]

Note: the code in R is on my Github 3. Variational Bayesian Gaussian Mixture Model(VB-GMM) 3.1 Graphical Model Gaussian Mixture Model & Clustering The variational Bayesian Gaussian mixture model(VB-GMM) can be represented as the above graphical model. We see each data point as a Gaussian mixture distribution with $K$ components. We also denote the number of data points as $N$. Each $x_n$ is a Gaussian mixture distribution with a weight $\pi_n$ corresponds to a data point....

July 9, 2021 · 11 min · SY Chou

Part I - Toward NNGP and NTK
  [draft]

Neural Network Gaussian Process(NNGP) Model the neural network as GP, aka neural network Gaussian Process(NNGP). Intuitively, the kernel of NNGP compute the distance between the output vectors of 2 input data points. We define the following functions as neural networks with fully-conntected layers: $$z_{i}^{1}(x) = b_i^{1} + \sum_{j=1}^{N_1} \ W_{ij}^{1}x_j^1(x), \ \ x_{j}^{1}(x) = \phi(b_i^{0} + \sum_{k=1}^{d_{in}} \ W_{ik}^{0}x_k(x))$$ where $b_i^{1}$ is the $i$th-bias of the second layer(the same as first hidden layer), $W_{ij}^{1}$ is the $i$th-weights of the first layer(the same as input layer) , function $\phi$ is the activation function, and $x$ is the input data of the neural network....

March 15, 2021 · 1 min · SY Chou

Some Intuition Of MLE, MAP, and Bayesian Estimation
  [draft]

The main different between 3 kinds of estimation is What do we assume for the prior? The Maximum Likelihood Estimation(MLE) doesn’t use any prior but only maiximize the probability according to the samples. On the other hand, MAP and Bayesian both use priors to estimate the probability. The Maximum A Posteriori(MAP) only use the probability of single event while Bayesian Estimation see a distribution as the prior. To be continue…...

February 19, 2021 · 1 min · SY Chou

An Introduction to Multi-Armed Bandit Problem
  [draft]

Multi-Armed Bandit Problem Imagine you are in a casionoand face multiple slot machines. Each machine is configured with an unknown probability of how likely you would get a reward at one play. The question is What’s the strategy to get the highest long-term reward? An illustration of multi-armed bandit problem, refer to Lil’Log The Multi-Armed Bandit Problem and Its Solutions Definition Upper Confidence Bounds(UCB) The UCB algorithm give a realtion between upper bound and probability confidence....

February 16, 2021 · 2 min · SY Chou