Value-Decomposition Network(VDN)

QMIX

Problem Setup And Assumption

Constraint

The QMIX imporve the VDN algorithm via give a more general form of the contraint. It defines the contraint like

$$\frac{\partial Q_{tot}}{\partial Q_{a}} \geq 0, \forall a$$

where $Q_{tot}$ is the joint value function and $Q_{a}$ is the value function for each agent.

An intuitive eplaination is that we want the weights of any individual value function $Q_{a}$ are positive. If the weights of individual value function $Q_{a}$ are negative, it will discourage the agent to cooperate, since the higher $Q_{a}$, the lower joint value $Q_{tot}$. Trivially, the contraint of VDN is just a special case that $\frac{\partial Q_{tot}}{\partial Q_{a}} = 1$.

Network Architecture

The architecture of QMIX is like the following figure

For each agent a and time step $t$, there is one agent network that represents its individual value function $Q_a(τ^a, u_t^a)$. We represent agent networks as DRQNs that receive the current individual observation $o_t^a$and the last action $u_{t−1}^a$ as input at each time step.

The mixing network is a feed-forward neural network that takes the agent network outputs $Q_a(τ^a, u_t^a)$ as input and mixes them monotonically, producing the values of $Q_{tot}$. To enforce the monotonicity constraint, the weights (but not the biases) of the mixing network are restricted to be non-negative and is produced by hypernetwork.

Each hypernetwork consists of a single linear layer, followed by ReLU function to ensure that the mixing network weights are non-negative. Since we’ve assume that the multi-agent probllem can be solve by a joint value function. The hypernetworks take the current state $s_t$ as input.

Loss Function

QMIX can be trained end-by-end, the loss function is defined as

$$L(\theta) = \sum_{i = 1}^{b}[(y_i^{tot} - Q_{tot}(\tau, u, s; \theta))^2]$$

where $b$ is the batch size of transitions sampled from the replay buffer, and $y_{tot} = r + \gamma \ max_{u’} \ Q_{tot}(τ’, u’, s’; θ^−)$, and $θ^-$ are the parameters of a target network asin DQN

Weighted QMIX

Papers

IQL

Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning

VDN

Value-Decomposition Networks For Cooperative Multi-Agent Learning

QMIX

QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning

Weighted QMIX

Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning

Reference