FamilyDev

SD 1.5 and text

Added 2024-04-05 08:50:13 +0000 UTC

Today I'm going to talk about what SD can't do.

Stable Diffusion 1.5 was trained on a dataset with 5.85 billion images.

What is this huge dataset?

URL: the image url, millions of domains are covered
TEXT: captions, in english for en, other languages for multi and nolang
WIDTH: picture width
HEIGHT: picture height
LANGUAGE: the language of the sample, only for laion2B-multi, computed using cld3
similarity: cosine between text and image ViT-B/32 embeddings, clip for en, mclip for multi and nolang
pwatermark: probability of being a watermarked image, computed using our watermark detector
punsafe: probability of being an unsafe image, computed using our clip based detector

Important: SD was trained only at a resolution of 512 x 512, so not all of the dataset was used. And because of this, the most mathematically correct is the generation of images with a resolution of 512 x 512 pixels.

Since the creators had other tasks, they did not teach SD text. SD saw text - there was text on many images in the dataset, but they didn't explain to SD what it was. What happens because of this?

Imagine that you had looked through thousands of random texts but did not know what logic was behind their construction. There were many languages and they had different letters, so you did not understand what they meant and how to put them together. And now you were being asked to create a word based on the information you had. You would draw random letters you had often seen together, but the chance that they would make sense was very small.

Let's try this. The usual prompt is "clothes store sign, close-up".

We see SD trying to make a sign for a clothing store, but it doesn't make sense.

What should we do now, sad?

I really wanted to open a coffee shop with a beautiful sign.

ControlNet comes into play!

ControlNet is a neural network structure for controlling diffusion models by adding additional conditions. In our case, we will draw lines that the SD will be required to take into account.

Example: go into Photoshop, write the best name on the planet. In my case, it was "Amy"

Make a mask out of it so that our text has volume. Black is mean that the neural network can draws anything there, while white is the designation for the shape of any structure. In our example, the best name in the world.

Write ANY prompt without mentioning that we need to write any text.

It worked! Now, she has generated a picture based on our mask. Now, I can make text anywhere, and it will be perfect. Or not?

Get the mask from our dysfunctional coffee bar.

Writing a sign using Photoshop.

Generating it…

Not bad… It is definitely better than before, but not perfect. Far from perfect.

But why did SD write so poorly? I made a mask...

The problem is that the sign is too small. I repeat, the resolution is 512 x 512 pixels, so sign is 280 x 50 pixels and it is very difficult to create something beautiful at such a low resolution. It's time to use Inpaint, which I have already talked about. We select only the sign itself and, at a resolution of 800 x 800 using the same mask with a power of 0.5, generate the sign.

Voila! What a beauty! Now a lot of visitors will definitely come to my coffee bar!

Bottom line: While SD may not be able to generate text on its own, with a little effort, you can achieve anything (except fingers).

P.S. Please let me know what topics or subjects you would like to see me cover more frequently. I will try my best to include them in my future posts!