EM Algorithm Convergence Proof with KL Divergence

2 minute read

Published:

EM_KL

EM algorithm Convergence Proof with KL Divergence

In my previous blog, I introduce the derivation of EM algorithm and convergence proof via defining its auxiliary function. Here I will introduce another method to prove the convergence of EM.

First let us denote X,ZX, Z as the observed variable and hidden variable, and denote DD as collected iid samples, θ\theta as the parameters.

PX,Z(D,z;θ)=PZX(zD;θ)PX(D;θ)logPX(D;θ)=logPX,Z(D,z;θ)logPZX(zD;θ)\begin{aligned} P_{X,Z}(D,z;\theta)&=P_{Z|X}(z|D;\theta)P_X(D;\theta) \\ \log P_X(D;\theta)&=\log P_{X,Z}(D,z;\theta)-\log P_{Z|X}(z|D;\theta) \end{aligned}

Taking expectations on both sides and using the fact that the left hand side does not depend on Z.

logPX(D;θ)=EZX;θn(logPX,Z(D,z;θ))EZX;θn(logPZX(zD;θ))=Q(θ,θn)+H(θ,θn)\begin{aligned} \log P_X(D;\theta)&=E_{Z|X;\theta^n}(\log P_{X,Z}(D,z;\theta))-E_{Z|X;\theta^n}(\log P_{Z|X}(z|D;\theta)) \\ &=Q(\theta,\theta^n)+H(\theta,\theta^n) \end{aligned}
where Q function is what EM algorithm actually optimize, and Q is the lower bound of log likelihood. Note that:

H(θ,θn)=EZX;θn(logPZX(zD;θ))=PZX(zD;θn)logPZX(zD;θ)dz=pn(z)logp(z)dz=KL[pnp]+H[pn]\begin{aligned} H(\theta,\theta^n)&=-E_{Z|X;\theta^n}(\log P_{Z|X}(z|D;\theta)) \\ &=-\int P_{Z|X}(z|D;\theta^n) \log P_{Z|X}(z|D;\theta) dz \\ &=-\int p_n(z) \log p(z) dz \\ &=KL[p_n||p]+H[p_n] \end{aligned}
where KL[pq]KL[p||q] is the Kullback-Leibler divergence between pp and qq, and H[p]H[p] the entropy of pp itself. So H(θ,θn)0,logPX(D;θ)H(θ,θn)H(\theta,\theta^n) \geq 0, \log P_X(D;\theta) \geq H(\theta,\theta^n)
img
To prove the convergence of EM, we need to prove:

logPX(D;θn+1)logPX(D;θn)logPX(D;θn+1)logPX(D;θn)=Q(θn+1,θn)+H(θn+1,θn)Q(θn,θn)H(θn,θn)=Q(θn+1,θn)Q(θn,θn)+KL[pnpn+1]+H[pn]KL[pnpn]H[pn]=Q(θn+1,θn)Q(θn,θn)+KL[pnpn+1]\begin{aligned} &\log P_X(D;\theta^{n+1}) \geq \log P_X(D;\theta^n) \\ &\log P_X(D;\theta^{n+1}) - \log P_X(D;\theta^n) \\ &=Q(\theta^{n+1},\theta^n)+H(\theta^{n+1},\theta^{n})-Q(\theta^{n},\theta^n)-H(\theta^{n},\theta^{n}) \\ &=Q(\theta^{n+1},\theta^n)-Q(\theta^{n},\theta^n)+KL[p_n||p_{n+1}]+H[p_n]-KL[p_n||p_n]-H[p_n] \\ &=Q(\theta^{n+1},\theta^n)-Q(\theta^{n},\theta^n)+KL[p_n||p_{n+1}] \end{aligned}
In M step we optimize Q,Q(θn+1,θn)Q(θn,θn)Q, Q(\theta^{n+1},\theta^n)\geq Q(\theta^{n},\theta^n), so logPX(D;θn+1)logPX(D;θn)\log P_X(D;\theta^{n+1}) - \log P_X(D;\theta^n). The convergence of EM is proved, but note that there is no guarantee of convergence to global maximum.

And since in the proof, we just increase Q function rather than maximize it. So in M step any step that increases it is sufficient, it is very useful when M-step is itself non-trivial:
* e.g. if there is no closed-form solution one has to resort to numerical methods, like gradient ascent.
* can be computationally intensive, lots of iterations per M-step
* in these cases, it is usually better to just perform a few iterations and move on to the next E-step
* no point in precisely optimizing M-step if everything is going to change when we compute the new E-step

Derive the Form of Discrete Variables

In previous part, I have derived the following form of Q function.

Q(θ,θn)=t=1Ti=0CP(Z=iX=xt;θn)logP(X=xt,Z=i;θ)Q(\theta,\theta^n)={\sum_{t=1}^{T}}{\sum_{i=0}^{C}}P(Z=i|X=x_t;\theta^n) \log{P(X=x_t,Z=i;\theta)}

Next I will show how to derive it from the Q function above.

EZP(zD;θn)(logPX,Z(D,z;θ))=EZP(zD;θn)(t=1NlogPX,Z(xt,zt;θ))=t=1NEZP(zD;θn)(logPX,Z(xt,zt;θ))=t=1NEZP(ztxt;θn)(logPX,Z(xt,zt;θ))E_{Z \sim P(z|D;\theta^n)}(\log P_{X,Z}(D,z;\theta)) \\ =E_{Z \sim P(z|D;\theta^n)}(\sum_{t=1}^{N}\log P_{X,Z}(x_t,z_t;\theta)) \\ =\sum_{t=1}^{N} E_{Z \sim P(z|D;\theta^n)}(\log P_{X,Z}(x_t,z_t;\theta)) \\ =\sum_{t=1}^{N} E_{Z \sim P(z_t|x_t;\theta^n)}(\log P_{X,Z}(x_t,z_t;\theta))