AutoQD: Automatic Discovery of Diverse Behaviors with Quality-Diversity Optimization

Abstract

Quality-Diversity (QD) algorithms have shown remarkable success in discovering diverse, high-performing solutions, but rely heavily on hand-crafted behavioral descriptors that constrain exploration to predefined notions of diversity. Leveraging the equivalence between policies and occupancy measures, we present a theoretically grounded approach to automatically generate behavioral descriptors by embedding the occupancy measures of policies in Markov Decision Processes. Our method, AutoQD, leverages random Fourier features to approximate the Maximum Mean Discrepancy (MMD) between policy occupancy measures, creating embeddings whose distances reflect meaningful behavioral differences. A low-dimensional projection of these embeddings that captures the most behaviorally significant dimensions is then used as behavioral descriptors for off-the-shelf QD methods. We prove that our embeddings converge to true MMD distances between occupancy measures as the number of sampled trajectories and embedding dimensions increase. Through experiments in multiple continuous control tasks we demonstrate AutoQD’s ability in discovering diverse policies without predefined behavioral descriptors, presenting a well-motivated alternative to prior methods in unsupervised Reinforcement Learning and QD optimization. Our approach opens new possibilities for open-ended learning and automated behavior discovery in sequential decision making settings without requiring domain-specific knowledge.

Method

💡 Key Insight

Occupancy measures can be embedded to reflect behavioral similarity. We embed occupancy measures into a finite-dimensional space where distances approximate Maximum Mean Discrepancy between policies, providing behavioral descriptors for Quality-Diversity optimization.

Policy Embedding via Random Fourier Features

For a policy $\pi$, its occupancy measure $\rho^{\pi}$ captures the expected discounted visitation frequency of state-action pairs:

\rho^{\pi}(\mathbf{s}, \mathbf{a}) = (1 - \gamma) \sum_{t=0}^{\infty} \gamma^t P(S_t = \mathbf{s}, A_t = \mathbf{a} | \pi)

We embed policies into a finite-dimensional space using random Fourier features of the state-action pairs that they visit $\phi(\mathbf{s}, \mathbf{a})$:

\phi(\mathbf{s}, \mathbf{a}) = \sqrt{\frac{2}{D}} \left[\cos(\mathbf{w}_1^T[\mathbf{s}; \mathbf{a}] + \mathbf{b}_1), \ldots, \cos(\mathbf{w}_D^T[\mathbf{s}; \mathbf{a}] + \mathbf{b}_D)\right]

where $\mathbf{w}_i \sim \mathcal{N}(0, \sigma^{-2}I)$ and $\mathbf{b}_i \sim U(0, 2\pi)$. The policy embedding is computed using $n$ rollouts of the policy as:

\psi^{\pi} = \frac{1}{n} \sum_{j=1}^{n} (1-\gamma) \sum_{t=0}^{T} \gamma^t \phi(\mathbf{s}_t^j, \mathbf{a}_t^j)

MMD Distance Approximation

The $\ell_2$ distance between embeddings approximates behavioral differences as measured by Maximum Mean Discrepancy:

\|\psi^{\pi_1} - \psi^{\pi_2}\| \approx \text{MMD}(\rho^{\pi_1}, \rho^{\pi_2})

Behavioral Descriptor Generation

We use calibrated weighted PCA to project high-dimensional embeddings to behavioral descriptors:

\text{desc}(\pi) = \mathbf{A}\psi^{\pi} + \mathbf{b}

where $\mathbf{A}$ and $\mathbf{b}$ are learned through weighted PCA that emphasizes high-performing policies, followed by calibration to ensure compatibility with QD archives.

Theoretical Guarantee

Theorem (MMD Approximation): For any two policies $\pi_1, \pi_2$ with embeddings $\phi_1, \phi_2$ estimated from $n$ samples each using $D$-dimensional random Fourier features:

\Pr\left[\left|\|\phi_1 - \phi_2\|_2 - \text{MMD}(\rho_1, \rho_2)\right| \geq \frac{3}{4}\epsilon\right] \leq 2e^{-nc\epsilon^2} + O\left(\frac{1}{\epsilon^2}\exp\left(-\frac{D\epsilon^2}{64(d+2)}\right)\right) + 6e^{-\frac{n\epsilon^2}{8}}

This theorem guarantees that our embeddings converge to true MMD distances between occupancy measures as the number of samples ($n$) and embedding dimensions ($D$) increase, providing theoretical grounding for our approach.

Discovering Diverse Behaviors

We evaluated AutoQD against five baseline methods across six continuous control environments using three metrics: Ground-Truth QD Score (with hand-crafted descriptors), Quality-Weighted Vendi Score (qVS), and Vendi Score (VS) for diversity assessment.

AutoQD consistently outperforms baseline methods in most environments, achieving the highest QD scores in 5 out of 6 tasks. The method successfully discovers diverse, high-performing policies without requiring domain-specific behavioral descriptors.

Additionally, the diverse policy collections discovered by AutoQD show superior adaptability when tested under modified environment dynamics, such as varying friction coefficients and mass scaling, indicating that the automatically discovered behavioral descriptors capture meaningful and robust behavioral variations.

BibTeX

@article{hedayatian2024autoqd,
  title={AutoQD: Automatic Discovery of Diverse Behaviors with Quality-Diversity Optimization},
  author={Hedayatian, Saeed and Nikolaidis, Stefanos},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2024}
}

AutoQD: Automatic Discovery of Diverse Behaviors with Quality-Diversity Optimization

AutoQD automatically discovers behavioral descriptors by embedding policy occupancy measures, enabling Quality-Diversity optimization without hand-crafted features.

Abstract

Method

💡 Key Insight

Policy Embedding via Random Fourier Features

MMD Distance Approximation

Figure 2: Policy embedding via random Fourier features. The $\ell_2$ distance between policy embeddings approximates the Maximum Mean Discrepancy between their occupancy measures, providing a theoretically grounded measure of behavioral similarity.

Behavioral Descriptor Generation

Theoretical Guarantee

Discovering Diverse Behaviors

Examples

Bipedal Walker

Swimmer

BibTeX