Neural Tangent Kernel(NTK)
“In short, NTK represent the changes of the weights before and after the gradient descent update”
Let’s start the journey of revealing the black-box neural networks.
Setup a Neural Network
First of all, we need to define a simple neural network with 2 hidden layers
where
Suppose we have a regression task on the network
where
To The Limit of Infinite Width
In order to measure the difference of the weights during training the neural network, we define a normalized metric as following
where
As we can see, the difference of the weights during training decrease as the width of network grows. As a result, the trained weights should be very close to the inital weights
Apply Taylor Expansion
We’ve known the Taylor expansion is
A function
It is trivial that if
Apply to the network
where
Thus, the Taylor expansion of
However, the most difficult thing is how can we guarantee the approximation is accurate enough? It is so complex that I wouldn’t put it in this article but I will provide an intuitive explaination of what does NTK mean? in the following article. Please keep reading it if you are interested in it.
An Simpler Explaination Without Flow
Simply, we only consider a 1-dimension network
First of all, let’s define the loss function of a neural network
The gradient descent is
where
NTK represent the changes of the weights before and after the gradient descent update. Thus, the changes of weights can be defined as
To simplify the notation, let
We can derive
Suppose the learning rate
We can get
Since the weight almost not change, let
Since
Flow And Vector Field
So far, we’ve shown the neural tangent kernel on 1-width network. To move forward to the infinite-wide network, we need 2 tools to help us analyzing the process of gradient descent in high-dimensionalal. As a result, before diving into NTK more deeply, we need to understand what is Gradient Flow and Vector Field.
Vector Field
Define a space
The gradient of the hyperplane
Then, we define a vector field
A hyperplane and the gradients can be illustrated as the following figure. The orange surface represents the hyperplane
Then we introduce another variable time. Let
As a result, we know
where
Gradient Flow
The gradient flow is defined as
The gradient flow describe changing gradients along time.
Combined With Gradient Flow
We’ve know the update of the gradient descent is
Let the function
Actually, the meaning of the gradient flow
We expand the gradient of the loss function with chain rule
Now we can derive the flow of the network
To simplify the notation, we replace the dynamics
Thus, we get
However, we’ve known the mathmatical form of the flow
Since
Actually, we are now very close to the neural tangent kernel(NTK). The NTK is a kernel matrix defined as
Since the weights of the infinite-wide network doesn’t change during the training.
We get
Again,
The way here to measure the distance between 2 tangents is the Cosine Similarity with inner product. The cosine value of 2 identical vector is 1 and 2 orthorgonal vectors is 0 which are totally different. With an additional minus sign, we can regard the negative similarity as a kind of distance.
To summary, the weights of an infinite-wide network almost don’t change during training. As a result, the kernel always stay almost the same. We can use NTK to analyze many properties of neural network and the neural networks are no longer black boxes.
Papers
NNGP
NTK
Reference
Thank for the following posts / people sincerely.
Gaussian Distribution
NNGP
- Deep Gaussian Processes
- Gonzalo Mateo García - Deep Neural Networks as Gaussian Processes
- Deep Gaussian Processes for Machine Learning
NTK
- Understanding the Neural Tangent Kernel By Rajat’s Blog
- Code for the blog rajatvd/NTK
- Ultra-Wide Deep Nets and the Neural Tangent Kernel (NTK)
- CMU ML Blog: Ultra-Wide Deep Nets and the Neural Tangent Kernel (NTK)
- Some Intuition on the Neural Tangent Kernel
- 直观理解Neural Tangent Kernel
Flow
Taylor Expansion