Predicting Steerability in Generative Models

Abstract

A central goal in controllable generative modeling is to steer a target concept through latent interventions without inadvertently changing other concepts. A natural approach is to use probes, but probe accuracy is not a reliable indicator of steerability. We introduce a post-hoc metric that measures whether a concept's influence is concentrated along a single direction in latent space, and show that it predicts steerability more reliably than probes and sparse probes on dSprites using a VAE. Together, these results offer a practical diagnostic for determining whether a concept can be effectively controlled in a generative model.

Method

To quantify how concentrated a concept's influence is in the latent space of a generative model, we compute the gradient of a concept classifier with respect to the generated image and then pull it back through the decoder's Jacobian. This yields a latent vector $\mathbf{g}(\mathbf{z}) = J_{\mathbf{z}}^{\top} \nabla_{\mathbf{x}} c$, which points in the direction of steepest change of the concept in the latent space. By collecting these vectors over a set of data points and performing principal component analysis (PCA), we measure how much of the total variance is explained by the first principal component. The fraction $\mathrm{VE}_1 = \lambda_1 / \sum_k \lambda_k$ (where $\lambda_1 \ge \lambda_2 \ge \dots$ are the eigenvalues of the covariance of $\{\mathbf{g}(\mathbf{z}_i)\}$) indicates whether the concept is aligned with a single dominant direction—a signature of good steerability.

Experiments

We train a $\beta$-TCVAE (Chen et al., 2019) ($\beta = 5$, $\alpha = \gamma = 1$) using the convolutional architecture of Burgess et al. (2018) on dSprites (dsprites, 2026), a dataset of 737,280 binary $64 \times 64$ images generated from five independent factors: shape, scale, rotation, $x$-position, and $y$-position. We train three seeds $\{0,1,2\}$ and report mean $\pm$ standard deviation throughout. We test against probe baselines.

References

[1] Chen, Ricky T. Q., Li, Xuechen, Grosse, Roger, and Duvenaud, David. (2019). Isolating Sources of Disentanglement in Variational Autoencoders. http://arxiv.org/abs/1802.04942

[2] Burgess, Christopher P., Higgins, Irina, Pal, Arka, Matthey, Loic, Watters, Nick, Desjardins, Guillaume, and Lerchner, Alexander. (2018). Understanding disentanglement in $\beta$-VAE. http://arxiv.org/abs/1804.03599

[3] Google DeepMind. (2026). google-deepmind/dsprites-dataset. https://github.com/google-deepmind/dsprites-dataset