What follows is a high-level overview, with anecdotes & musings, of generating images with Artificial Intelligence (AI) from my research. Enjoy!
Background #
By January 2024, I hit a boredom slump. My energy was almost nonexistent from my treatment. After months of shows and video games, I was burned out. I needed a palette cleanser. Something else to occupy me. That is when I started investigating this AI stuff that all the kids are talking about today.
There are plenty of ready to go Software as a Service (SaaS) options out there, such as OpenAI, Midjourney, DALL-E 3 and the like. However, my feeling is that I donate enough of my data to Google, Facebook, and all the other advertising platforms, so I started investigating off-line options. This was a wondrous rabbit-hole to fall into.
Stable Diffusion #
User Interface #
Image Generative AI was the first thing that really got my attention, so I dove into Stable Diffusion. Stable Diffusion is just one part of getting Text to Image generation working. You’ll need a user interface for Stable Diffusion, such as AUTOMATIC1111, Fooocus, and ComfyUI. AUTOMATIC1111 is the most popular UI. After trying all of them out, I stuck to ComfyUI. While it can get pretty complicated, I think the flexibility makes for a way more interesting time learning with it.
Models #
Once you have the UI working, there is another component you’ll need to start generating images: a checkpoint model. Diving into models for Stable Diffusion is like drinking from a fire hose, as there are so many.
Huggingface is the place for AI models and it current boasts over 13,000 Stable Diffusion models. As you can see, Huggingface isn’t a great way to find Stable Diffusion model capabilities.
WARNING: There is a lot of NSFW stuff on the following sites.
My preference is Civitai. This can be a rabbit-hole to get lost in. 😄 PixAI.art is another site I would use for finding models, but it’s not as good as Civitai, in my opinion.
When looking for models, along with “Checkpoint” you’ll see the term “LoRA” which stands for Low-Rank Adaptation. The are much smaller models that are used to refine the output of the Checkpoint models.
For example, say you write a prompt “Create a landscape of a beach. A German Shepard with <googly eyes> runs down the beach.” The checkpoint will create the landscape, and the Googly Eye LoRa will see it’s key word and modify the dog to have googly eyes.
More on prompts , farther down.
Equipment #
Initially, I started out trying Stable Diffusion on large CPU cloud computers. It took about 5+ mins per image. It was a fun exercise to get up and running but it was hard to iterate on ideas with the long creation times. Reading online, someone mentioned it took 7 seconds to create an image. My ears perked up at this. I immediately started shopping for an NVIDIA GPU.
At the time, NVIDIA processors were the only way to do quick iteration. The selection of GPUs you can use has expanded since then.
When my GeForce RTX 4060 Ti was up and running, I was off to the races. It was generating 150x150 images in seconds. With this generation time, I was able to start iterating on ideas I had, in Stable Diffusion.
Ideas #
Wizards #
I am a Dungeons and Dragons nerd. From way back. First Edition in the 1980s way back. With that said, the first thing I did was start generating images of me as a wizard.
It was awesome!
Here are two images with me as a wizard.
You can see how, from the first to the second, the iterations improve the image. If I recall correctly the second image had a “refiner”, which took the first image and “refined” it with more detail.
Of course, I generated tons of me as a wizard. Here is another I rather enjoyed. 😆
Warhammer 40K #
With the wizard images created, my mind blew at the options in front of me. I mentioned I was a nerd, right? While I never played Warhammer 40000, as a youngster I read all the manuals and loved the concept. So, of course, I needed me in Power Armour.
Wondrous #
At this point, I really let loose. Here is where things became wonderful, wild and weird…
I spent too much time on this one. It was a lark. A passing idea, but I really like the way it turned out. I don’t love this one. It doesn’t really look like me and the Sasquatch looks a bit cartoony. The thing is, I picked this one because it was the most “realist” Sasquatch I could create. I kept iterating but it wasn’t improving, so I kind of lost interest. I miss hiking in the Pacific Northwet. Erm Northwest. This one is my favourite so far. The expression on the monkey’s face, along with my expression, make it for me. This is a great example of one of the shortcomings of Stable Diffusion: Details.
Spotting Generated Images #
If you look at the horses’ hooves, it seems off. They seem too deep into the sand to move at speed.
Oh, and of course the phantom rider’s leg on the third horse. 😆 I supposed I could manually edit it out, but it felt like cheating.
At the time I was working with this, hands were the biggest issue to generate. I know this has improved, so caveat emptor.
If you go back up and look at the wizard images, you can see things wrong with the hands. The first two images are missing an index finger on the right hand, kind of. It implied the finger is behind the lightening, but it doesn’t seem right. Speaking of seeming right, the whole right hand just seems a little off.
What’s in the hand of the red wizard? 🤷
This is a big clue when images are generated. If it seems just a little off, it’s likely the image was machine generated.
A Few Technical Bits #
Masks #
To show a little more behind the curtain of what goes into making these kind of images, this is the mask I used to create it. Each colour was assigned a reference image as the input to generate the above monkey image. You can figure out which is which.
Prompts #
Along with the masking and images, positive and negative word prompts were used. This is called prompt engineering and it’s a whole topic on it’s own. Unfortunately, I didn’t record the prompts for the above images, so I can’t share those.
This, prompt engineering, is where I spent the most time iterating. It’s a very strange process putting your idea into words and see how the AI interprets them. You then go through dissecting what AI generated from your input and rewrite your idea into what you think the AI needs.
Ad nauseam.
I should also mention that I used a face swapping tool to put me face on the man running. Whew!
Summary #
After spending time on Stable Diffusion, I drifted into working with Large Language Models (LLMs). That is a whole topic in itself, so I will save that for another day.
I will probably return to working with Generative AI in the near future. They are making great strides in generating video content, so I’m curious to see where that will go.
If you are interested in jumping in, here are a few take-aways.
- This is a high-level overview, so there is a lot I’ve glossed over, such as “Samplers”, “Seed” data, “Steps”, etc. These are all things you’ll figure out quick, when you get going.
- Everything in the AI space changes quickly. Week to week, new things are cropping up. Read blogs/news sites and watch videos to keep up. Discord is where most discussion seems to happen.
- If you don’t care about privacy, go with SaaS AI options. They are much much easier to use. Otherwise, get a GPU. One with as much memory as your budget will allow.
- To get every single image above, I generated dozens of images. It takes a long time.
- Start with AUTOMATIC1111 or Fooocus before moving to ComfyUI
That’s it for now! Good luck!