The latent generation phase (Stage C) converts the user input into compact 24×24 latent values and passes them to the latent decoder phase (Stages A and B). The latent decoder phase is used to compress the image, which is similar to the role of the VAE in Stable Diffusion, but with much higher compression ratios.
By separating the text conditional generation (Stage C) from the decoding into high-resolution pixel space (Stages A and B), additional training and fine-tuning, including ControlNet and LoRA, can be completed independently in Stage C. This provides a 16x cost reduction compared to training a similarly sized stable diffusion model (as shown in the original image). paper). Stages A and B can be optionally fine-tuned for additional control, but this is equivalent to fine-tuning a VAE with a stable diffusion model. In most cases, the additional benefit is minimal, so we recommend training Stage C and leaving Stages A and B in their original state.
Stage C and Stage B will be released in two different models: 1B and 3.6B parameters for Stage C, and 700M and 1.5B parameters for Stage B. For Stage C, we recommend using the 3.6B model, which has the highest quality output. However, if you want to focus on the lowest hardware requirements, you can use the 1B parameter version. For Stage B, both achieve good results, but the 1.5B is better at reconstructing fine details. Due to Stable Cascade's modular approach, the VRAM required for inference can be kept to around 20 GB, but can be lowered further by using the smaller variant (which, as mentioned above, may also reduce the final output quality).
Comparison
Our evaluation results show that in almost all model comparisons, Stable Cascade performs best in both fast alignment and aesthetic quality. The figure shows the results of human evaluation of the following combination of models: Party Prompt And the aesthetic prompt: