Policy Embedding via Random Fourier Features
For a policy $\pi$, its occupancy measure $\rho^{\pi}$ captures the expected discounted visitation frequency of state-action pairs:
We embed policies into a finite-dimensional space using random Fourier features of the state-action pairs that they visit $\phi(\mathbf{s}, \mathbf{a})$:
where $\mathbf{w}_i \sim \mathcal{N}(0, \sigma^{-2}I)$ and $\mathbf{b}_i \sim U(0, 2\pi)$. The policy embedding is computed using $n$ rollouts of the policy as:
MMD Distance Approximation
The $\ell_2$ distance between embeddings approximates behavioral differences as measured by Maximum Mean Discrepancy: