Reusing Computation in Text‑to‑Image Diffusion for Efficient Generation of Image Sets

Dale Decatur1 Thibault Groueix2 Yifan Wang2 Rana Hanocka1 Vova Kim2 Matheus Gadelha2
1University of Chicago, 2Adobe Research

Our approach shares denoising steps across correlated prompts to enable more efficient generation of image sets. We leverage the coarse-to-fine nature of diffusion models, where early denoising steps capture shared structures among similar prompts. We propose a training-free approach that clusters prompts based on semantic similarity and shares computation in early diffusion steps. Experiments show that for models trained conditioned on image embeddings, our approach significantly reduces compute cost while improving image quality.

Method overview

Overview of our shared hierarchical diffusion. Left: our approach relies on a tree structure obtained by running agglomerative clustering on a set of prompt embeddings. Each node in the tree contains the average of the embeddings of its children, and a heterogeneity score \(c^{\text{score}}\) based on the distance between its two children. To connect the tree hierarchy to the denoising steps, we design a function \(\phi\) taking as input the denoising step \(k\), and controlling via its output and \(c^{\text{score}}\) which level to use in the tree for step \(k\). Right: As a result, early diffusion steps are shared using the averaged embeddings, and the denoising steps gradually diverge to individual prompt embeddings, resulting in saved computation while maintaining high image generation quality.

Qualitative comparison

Comparison with standard diffusion inference. We compare our approach (middle) with standard diffusion for a fixed compute budget of \(18\) steps (left) and a full compute budget of \(40\) steps (right). For the fixed compute budget, our method produces higher quality images than the standard approach. When compared to the \(40\) step compute budget, our approach uses significantly less steps while generating images of comparable, if not higher quality.

Coarse-to-fine generation

Coarse-to-fine generation. We show intermediate latents from the diffusion process for Stable Diffusion 1.5 (top row), Stable UnCLIP (second row), Karlo (third row), and Kandinsky (bottom row). The models trained without a text-to-image prior (Stable Diffusion and Stable UnCLIP) learn structural details and high frequency features earlier on in the diffusion process. In contrast, Karlo and Kandinsky which were trained with the text-to-image prior, learn structural details later in denoising and are able to quickly add high frequency details in few steps, making them ideal for our compute sharing approach.

Generation quality vs. compute

Generation quality at fixed compute budgets. We report VQA Score on our Prompt Template Dataset for both our method (orange) and the standard approach (blue) over various diffusion step budgets. Note that at a \(40\)-step budget, our approach is identical to the standard approach. For all other fixed compute budgets, our method achieves higher VQA Scores (a measure of generation quality) than the standard approach.

Compute savings

Compute savings using our approach. We run our method on four diverse datasets ranging from most general (GenAI Bench) to more structured (Style Variations) and compute the VQA Score for both our method and the standard approach. For each dataset, we report the number of images, the percentage of compute we save using our approach relative to standard denoising, and the percentage of images on which our approach achieves a higher VQA Score. From the win percentages, we can see that our method consistently produces comparable if not higher quality results to the standard approach, all while using significantly less compute (as evidenced by the compute saved percentages).

Applications

Applications. We show several specific applications of our method. Left: efficiently generating style variations on given input prompt. We show a subset of 100 generated images, saving 74% of the equivalent compute for the standard approach. Middle: efficiently generating subject variations on given input prompt. We show a subset of 500 generated images, saving 76% of the equivalent compute for the standard approach. Right: virtual try-on. Our method generates a set of only 16 images, all containing the same subject, with various accessories, saving 65.3% of the equivalent compute for the standard approach.

Acknowledgments

We thank Richard Zhang for insights on evaluation metrics and more generally, the members of Adobe Research and 3DL for their insightful feedback. This work was supported by Adobe Research and NSF grant 2140001.

BibTeX

@InProceedings{decatur2025reusing,
  title={Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets},
  author={Dale Decatur and Thibault Groueix and Yifan Wang and Rana Hanocka and Vova Kim and Matheus Gadelha},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={},
  year={2025}
}