Skip to main content

VAE & You!

AlexiMonke


If you've been utilizing Stable Diffusion for image generation for really any amount of time, the chances of you running into the term "VAE" are very, very high.
You know it perhaps makes images look less washed out. Or can maybe effect how faces are rendered on an SD-Checkpoint, but what actually is it?

We'll continue to refer to it by "VAE", but the acronym itself stands for "Variational AutoEncoder". Now there's a frankly insane amount of rabbit holes we could fall into when it comes to explaining what a VAE is, and what it does. But that's not the purpose of these pages.
I'm just here to get you familiar with some things so you understand the concept.

Extremely reductive metaphor, but if the AI is drawing for you, then the VAE is kind of like the pencil it actually does it with.

The Pipeline goes something like this:
➤ AI is given an image or the specifications of an image (i.e. dimensions)
➤ The AI "encodes" this into a "latent space image" with the VAE
➤ The AI does its thing (there's a lot of math and computer thinking here—Pretty scary if you ask me)
➤ The AI "decodes" this into a "pixel space image" with the VAE
➤ You are presented with the final result

So... What's actually happening here?


First—"Encoding"
The image or the specifications for it first need to be encoded into a latent space image. This is what allows our AI friend to understand what it's actually looking at. It's got a whole lot of equations and mathematical things going on that's allowing it to break down the image we gave it, and understand it in its own way. Our AI buddy might know how to think, but it doesn't "think" in the same way you or I do.

Second—"Doing its thing"
This is where the AI is doing stuff in the sampler node. It's looking at your prompts, the image or specifications you gave it, and then proceeding to do what I'm fairly certain is a computer's version of haruspicy.
(Might not be the case—My understanding is a little hazy)

Third—"Decoding"
Remember when I said that encoding the image turns it into a bunch of code and equations for the AI to understand it? Well I've got good news and bad news!
Good news is our AI friend just finished generating your image!
Bad news is it's still in those funny esoteric equations that still scare me.

So, that's what decoding is for! Whatever the heck the AI divined from its digital haruspicy, it now translates into something in pixel space that you can actually look at!


ULTRA tl;dr / Terms Broken Down
tl;dr
VAE is how AI takes what you're feeding it, understands it, then reconstructs it
Terms:
VAE:
The tool the AI uses to break down concepts into things it can understand and also reconstruct them into visual images.
Pixel Space Image: Funny name for "visual images". Just stuff constructed from pixels.
Latent Space Image: How the AI sees the image or criteria you gave it.


"Do I need a VAE?"
Yes. It's literally a requirement.

"I didn't utilize a VAE on the Automatic WebUI."
You did. All SD Checkpoints have a VAE baked into them.

"If all SD Checkpoints have a VAE baked in, why use a different one?"
Because baking in a VAE is needed for it to function.
There's vastly superior VAEs in circulation.
A very basic stock one is often baked into most checkpoints you'll find.

"Why don't Checkpoint creators just bake in a better VAE?"
Because "better" in this context is subjective.
VAEs are trained on a variety of different materials. A VAE trained to achieve an anime style won't function the same way a VAE trained to make photo-realistic images will.
Thus, it's best if the decision of what VAE to use is left up to the user. That way they can utilize the VAE that best tailors their needs.