Jhoan K. Hoyos-Osorio,Luis G. Sanchez-Giraldo
Abstract:Quantifying the difference between probability distributions is crucial in machine learning. However, estimating statistical divergences from empirical samples is challenging due to unknown underlying distributions. This work proposes the representation Jensen-Shannon divergence (RJSD), a novel measure inspired by the traditional Jensen-Shannon divergence. Our approach embeds data into a reproducing kernel Hilbert space (RKHS), representing distributions through uncentered covariance operators. We then compute the Jensen-Shannon divergence between these operators, thereby establishing a proper divergence measure between probability distributions in the input space. We provide estimators based on kernel matrices and empirical covariance matrices using Fourier features. Theoretical analysis reveals that RJSD is a lower bound on the Jensen-Shannon divergence, enabling variational estimation. Additionally, we show that RJSD is a higher-order extension of the maximum mean discrepancy (MMD), providing a more sensitive measure of distributional differences. Our experimental results demonstrate RJSD's superiority in two-sample testing, distribution shift detection, and unsupervised domain adaptation, outperforming state-of-the-art techniques. RJSD's versatility and effectiveness make it a promising tool for machine learning research and applications.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to quantify the differences between probability distributions in machine learning. Specifically, since the underlying distribution of data is usually unknown in practical applications, it is challenging to estimate statistical divergences (such as Jensen - Shannon divergence) from empirical samples. This paper proposes a new metric method - Representation Jensen - Shannon Divergence (RJSD). This method embeds data into the Reproducing Kernel Hilbert Space (RKHS) and uses the uncentered covariance operator to represent the distribution, thereby establishing an effective probability distribution divergence metric in the input space. In addition, the paper also provides estimators based on the kernel matrix and the empirical covariance matrix and proves that RJSD is a lower bound of the traditional Jensen - Shannon divergence and can perform variational estimation. Experimental results show that RJSD outperforms the existing state - of - the - art techniques in two - sample tests, distribution shift detection, and unsupervised domain adaptation tasks.
### Formula and Symbol Explanation
- **Jensen - Shannon Divergence**:
\[
D_{\text{JS}}(P, Q) = H\left(\frac{P + Q}{2}\right)-\frac{1}{2}\left(H(P)+H(Q)\right)
\]
where \(H(P)\) represents the Shannon entropy of the probability distribution \(P\).
- **Representation Jensen - Shannon Divergence (RJSD)**:
\[
D_{\text{H}}^{\text{JS}}(P, Q) = D_{\text{JS}}(C_P, C_Q) = S\left(\frac{C_P + C_Q}{2}\right)-\frac{1}{2}\left(S(C_P)+S(C_Q)\right)
\]
where \(C_P\) and \(C_Q\) are the uncentered covariance operators of the probability distributions \(P\) and \(Q\) in RKHS respectively, and \(S(C)\) is the von Neumann entropy of the covariance operator.
- **Maximum Mean Discrepancy (MMD)**:
\[
\text{MMD}_\kappa^2(P, Q)=\|\mu_P-\mu_Q\|^2_H
\]
where \(\mu_P\) and \(\mu_Q\) are the mean embeddings of the probability distributions \(P\) and \(Q\) in RKHS respectively.
- **Kernel Matrix**:
\[
K_X=\left[\kappa(x_i, x_j)\right]_{i, j = 1}^n
\]
where \(\kappa\) is a kernel function and \(X = \{x_1, x_2,\ldots, x_n\}\) is a sample set.
### Main Contributions
1. **Extension of Jensen - Shannon Divergence**: Extend the traditional Jensen - Shannon divergence to infinite - dimensional covariance operators and define RJSD.
2. **Avoid Density Estimation**: Map data to RKHS and use uncentered covariance operators to represent distributions, avoiding direct estimation of the underlying density function.
3. **Estimators**: Propose a sample - based RJSD estimator and discuss its consistency results.
4. **Relationship with MMD**: Establish the connection between RJSD and MMD and prove that MMD can be regarded as a special case of RJSD.
5. **Variational Estimation**: Prove that RJSD is a lower bound of the classical Jensen - Shannon divergence, so that a variational estimator can be constructed.
### Experimental Results
Experimental results show that RJSD performs excellently in two - sample tests, distribution shift detection, and unsupervised domain adaptation tasks and outperforms the existing state - of - the - art techniques.