Shannon Entropy
For discrete random variable $X$ with events $\{ x_1, …, x_n \}$ and probability mass function $P(X)$, we defien the Shannon Entropy $H(X)$ as
$$H(X) = E[-log_b \ P(X)] = - \sum_{i = 1}^{i = n} \ P(x_i) log_b \ P(x_i)$$
where $b$ is the base of the logarithm. The unit of Shannon entropy is bit for $b = 2$ while nat for $b = e$
The Perspective of Venn Diagram
We can illustrate the relation between joint entropy, conditional entropy, and mutual entropy as the following figure
where $H(X), \ H(Y)$ are Shannon entropy of RV $X, \ Y$ respectively. $H(X, Y)$ is the joint entopy. $I(X; Y)$ is the mutual entropy. $H(X|Y), \ H(Y|X)$ are conditional entropy that given $Y$ and $X$ respectively.
Joint Entropy
The joint distribution is $P(X,Y)$ for two discrete random variables $X$ and $Y$. Thus the joint entropy is defined as
$$H(X, Y) = E[-log \ P(X, Y)] = - \sum_{i = 1}^{i = n} \sum_{j = 1}^{j = n} \ P(x_i, y_j) log \ P(x_i, y_j)$$
Conditional Entropy
The conditional entropy of $Y$ given $X$ is defined as
$$H(Y | X) = H(X, Y) - H(Y)$$
$$= - \sum_{i = 1}^{i = n} \sum_{j = 1}^{j = n} \ P(x_i, y_j) log \ P(x_i, y_j) - (- \sum_{j = 1}^{j = n} \ P(y_j) log \ P(y_j))$$
$$= \sum_{j = 1}^{j = n} \ (\sum_{i = 1}^{i = n} P(x_i, y_j)) log \ (\sum_{i = 1}^{i = n} P(x_i, y_j)) - \sum_{i = 1}^{i = n} \sum_{j = 1}^{j = n} \ P(x_i, y_j) log \ P(x_i, y_j)$$
$$ = \sum_{i = 1}^{i = n} \sum_{j = 1}^{j = n} \ P(x_i, y_j) (log \ (\sum_{i = 1}^{i = n} P(x_i, y_j)) - log \ P(x_i, y_j)) $$
$$ = \sum_{i = 1}^{i = n} \sum_{j = 1}^{j = n} \ P(x_i, y_j) (- log \ \frac{P(x_i, y_j)}{P(y_j)}) $$
$$ = - \sum_{i = 1}^{i = n} \sum_{j = 1}^{j = n} \ P(x_i, y_j) log \ \frac{P(x_i, y_j)}{P(y_j)} $$
Mutual Information(MI)
In general, MI is the intersection or say the common information for both random variables $X$ and $Y$.
Let $(X,Y)$ be a pair of random variables with values over the space $X \times Y$. If their joint distribution is $P(X,Y)$ and the marginal distributions are $P_X(X)$ and $P_Y(Y)$, the mutual information is defined as
$$I(X; Y) = H(X) + H(Y) - H(X, Y)$$
$$ = - \sum_{i = 1}^{i = n} \ P_X(x_i) log \ P_X(x_i) - \sum_{j = 1}^{j = n} \ P_Y(y_j) log \ P_Y(y_j) + \sum_{i = 1}^{i = n} \sum_{j = 1}^{j = n} \ P(x_i, y_j) log \ P(x_i, y_j) $$
$$ = - \sum_{i = 1}^{i = n} \ (\sum_{j = 1}^{j = n} P(x_i, y_j)) \ log (\sum_{j = 1}^{j = n} P(x_i, y_j)) - \sum_{j = 1}^{j = n} \ (\sum_{i = 1}^{i = n} P(x_i, y_j)) \ log (\sum_{i = 1}^{i = n} P(x_i, y_j)) $$
$$ +\sum_{i = 1}^{i = n} \sum_{j = 1}^{j = n} \ P(x_i, y_j) log \ P(x_i, y_j) $$
$$ = \sum_{i = 1}^{i = n} \sum_{j = 1}^{j = n} \ P(x_i, y_j) log \ \frac{P(x_i, y_j)}{P_X(X) \ P_Y(Y)} $$
In the view of set, MI can also be defined as
$$I(X; Y) = H(X) + H(Y) - H(X, Y)$$
$$ = H(X) - H(X | Y) = H(Y) - H(Y | X) $$
$$ = H(X, Y) - H(X | Y) - H(Y | X) $$