Kullback-Leibler(KL\mathrm {KL}KL) loss
(离散)For discrete probability distributions F(x)F(x)F(x) and G(x)G(x)G(x), the Kullback-Leibler (KL\mathrm {KL}KL) loss from F(x)F(x)F(x) to G(x)G(x)G(x) is defined[5] to be
KL{F(x)∥G(x)}=∑i=1nF(x)logF(x)G(x). \mathrm {KL}\{F(x)\|G(x)\} = \sum_{i=1}^nF(x)\log\frac{F(x)}{G(x)}.KL{F(x)∥G(x)}=i=1∑nF(x)logG(x)F(x).
(连续)For distributions F(x)F(x)F(x) and G(x)G(x)G(x) of a continuous random variable, the Kullback–Leibler(KL\mathrm {KL}KL) loss is defined to be
KL{F(x)∥G(x)}=∫−∞∞f(x)logf(x)g(x)dx\mathrm {KL}\{F(x)\|G(x)\} = \int_{-\infty}^{\infty}f(x)\log\frac{f(x)}{g(x)}dx KL{F(x)∥G(x)}=∫−∞∞f(x)logg(x)f(x)dx
where f(x)f(x)f(x) and g(x)g(x)g(x) is the densities function of F(x)F(x)F(x) and G(x)G(x)G(x).
The Kullback–Leibler loss is always non-negative(始终非负), that is
KL{F(x)∥G(x)}⩾0.\mathrm {KL}\{F(x)\|G(x)\}\geqslant0.KL{F(x)∥G(x)}⩾0.
The Kullback–Leibler(KL\mathrm {KL}KL) loss KL{F(x)∥G(x)}\mathrm {KL}\{F(x)\|G(x)\}KL{F(x)∥G(x)} is convex(凸的) in the pair of probability mass functions (f,g){\displaystyle (f,g)}(f,g), i.e. if (f1,g1){\displaystyle (f_{1},g_{1})}(f1,g1) and (f2,g2){\displaystyle (f_{2},g_{2})}(f2,g2) are two pairs of probability mass functions, then KL{λf1+(1−λ)f2∥λg1+(1−λ)g2}≤λKL(f1∥g1)+(1−λ)KL(f2∥g2){\mathrm {KL}\{\lambda f_{1}+(1-\lambda )f_{2}\|\lambda g_{1}+(1-\lambda )g_{2}\}\leq \lambda \mathrm {KL} (f_{1}\|g_{1})+(1-\lambda )\mathrm {KL} (f_{2}\|g_{2})}KL{λf1+(1−λ)f2∥λg1+(1−λ)g2}≤λKL(f1∥g1)+(1−λ)KL(f2∥g2) for 0≤λ≤10\leq\lambda\leq10≤λ≤1.
eg: Multivariate normal distributions
Suppose that we have two multivariate normal distributions, with means μ0,μ1{\displaystyle \mu _{0},\mu _{1}}μ0,μ1 and with (nonsingular) covariance matrices Σ0,Σ1{\displaystyle \Sigma _{0},\Sigma _{1}}Σ0,Σ1. If the two distributions have the same dimension, k, then the Kullback–Leibler(KL\mathrm{KL}KL ) loss between the distributions is as follows:
KL(N0∥N1)=12{tr(Σ1−1Σ0)+(μ1−μ0)TΣ1−1(μ1−μ0)−k+log(detΣ1detΣ0)}.\mathrm{KL}({\mathcal {N}}_{0}\|{\mathcal {N}}_{1})={1 \over 2}\left\{\mathrm {tr} \left(\Sigma _{1}^{-1}\Sigma _{0}\right)+\left(\mu _{1}-\mu _{0}\right)^{\text{T}}\Sigma _{1}^{-1}(\mu _{1}-\mu _{0})-k+\log \left({\det \Sigma _{1} \over \det \Sigma _{0}}\right)\right\}. KL(N0∥N1)=21{tr(Σ1−1Σ0)+(μ1−μ0)TΣ1−1(μ1−μ0)−k+log(detΣ0detΣ1)}.