Guide to Stable Diffusion

This is a secret DRAFT guide to Stable Diffusion.

It’s secret because it’s not yet in a shareable format. It’s mainly a dump of links that must be accompanied by introductory text or that must be digested and summarized.
Also, this guide has to be significantly improved from a structure and design standpoint.

Most of the content below is based on version 1.5 of the Stable Diffusion model. As the community experiments with the new version 2.1 of the Stable Diffusion model, this guide will be updated.
If you are using a different model, some of this guidance might be inapplicable.

Living Document – Last update: Jan 30, 2023

Table of Contents

Latent diffusion models

You don’t need this to learn how to use Stable Diffusion, but this section will give you a much better understanding of why Stable Diffusion works in the way it does.

What are they

What happens behind the scenes

Origins of Stable Diffusion

Stable Diffusion is an AI model developed by Patrick Esser from Runway and Robin Rombach from LMU Munich. The research and code behind Stable Diffusion was open-sourced last year. The model was released under the CreativeML Open RAIL M License.

The full story:
Original paper:

The LAION and LAION-Aesthetics datasets

User Interfaces to the Stable Diffusion model

Below I list software that can be installed locally on any platform (including Apple M systems) or that is available via a SaaS model. If a software is available as Windows-only or Windows+Linux only, it won’t be included in the list.

Dream Studio by Stability.AI (beta)

InvokeAI by Lincoln Stein & team

Stable Diffusion CLI by Stability.AI

Stable Diffusion WebUI by AUTOMATIC1111 & team

Applications of the Stable Diffusion model

Stable Diffusion is famous for generating images from a sentence written in natural language (mainly plain English at the time of writing). This process is called text to image (txt2img). However, the Stable Diffusion model is capable of much more. These are its main applications:

Text to Image (txt2img)

The term txt2img can refer to either a process or a model.
A txt2img process is the act of generating an image starting from a text description (called prompt).
A txt2img model is a latent diffusion model optimized for the txt2img process.

Image to Image (img2img)

The term img2img only refers to a process.
An img2img process is the act of generating an image starting from a preexisting picture.
At the time of writing, there is no specialized img2img model, however:

Depth-Conditional Stable Diffusion (depth2img)

Alongside the launch of the Stable Diffusion 2.0 model, Stability AI has released a special variant called depth2img model.

The depth2img model is much better than the standard txt2img model at conditioning the image generation in an img2img process.
The depth2img model retains the structure and shape of the starting picture:


depth2img model vs. txt2img model:

depth2img model vs. depthmap2mask script:

depth2img model vs. inpainting conditioning mask strength:

Best practices

To get as close as possible to the original image, use low denoise, setting the “Denoise Strenght” to a value within 0.1 and 0.3, with the Euler sampler.

Rather than replicating an existing picture, a creative way to use the img2img capability is to compose an original image with the help of multiple tools:

Using img2img to cartonize photos:

Using LEGO to stage complex scenes:


The term inpainting can refer to either a process or a model.
An inpainting process is the act of replacing a portion of a pre-existing image by generating a new version of that portion.
An inpaiting model is a latent diffusion model optimized for the inpainting process.

While it is possible to use a txt2img model for the inpaiting process, it’s highly recommended to use one of the following inpaiting models:

How to use an inpainting model:

Best practices

Inpaiting and high resolutions:

Turn any model into an inpainting model:


A very different way to do inpainting.

Sketch Inpainting


The term outpainting only refers to a process.
An outpaiting process is the act of extending a pre-existing image by generating additional portions of the image on a larger canvas.
At the time of writing, there is no specialized outpaiting model, however, you should use an inpaiting model for the outpaiting process:

Outpainting with AUTOMATIC1111 WebUI:

Other techniques:

Text to Video (txt2vid)

Latent binding method:

Steps Animation extensions:

Text to 3D (txt23D)

Prompt Engineering for Stable Diffusion

Prompt tokenization and its implications


Modifiers Studies

Artistic styles and artists for paintings/illustrations/drawings

Artists Studies

Camera focus and shot types for photos

DOF Simulator

Shot types

Prompt Structure

The prompt structure that has produced the best results for me:
Medium + Context + Subject + Modifiers + Style/Artist

If you are trying to generate an image mimicking the style of a specific artist, the Stable Diffusion model is particularly sensitive to the expression “by artist Artist Name Artist Surname”. The keyword “by artist” seems to have a significant impact on the generation of the image.

Attention/Emphasis Control (Weighted Prompt)

Negative Prompt

Composed Prompt

Prompt Editing

Prompt Alternating

Prompt Interpolation

Cross Attention Control

Other Tips

Useful browser extensions

Where to find prompts to learn from

Synthetic media search engines


DiffusionDB is the first large-scale dataset containing 14 million #StableDiffusion images and their text prompts and hyperparameters.

The Open Prompts project

CLIP Interrogator

Online versions:

Magic Prompt (aka prompt randomizer)

Parameters of the Stable Diffusion model

Diffusion steps

Classifier-Free Guidance (CFG) scale

While the default settings is 7-7.5 in most User Interfaces for Stable Diffusion, and many studies show optimal results at a CFG Score of 14-18, for some reasons I have the best results at a CFG Score of 4. Anything beyond or below that is unsatisfactory most of the time.

High CFG Score results in over-saturated images:

CLIP layers

CLIP (Contrastive Language–Image Pre-training) is an AI model originally developed by OpenAI to convert text to images. Every generative AI tool/product/service designed to use the Stable Diffusion 1.x models uses CLIP to translate your text to an image and the diffusion model to arrive to the result you want.

The technical paper describing how CLIP works is here:
You don’t have to read it, but it’s useful to understand what comes next.

AUTOMATIC1111 WebUI gives you some control on how CLIP works:

This setting is especially good when you use diffusion models like the NaiveAI one.

With the launch of the Stable Diffusion 2.0 model, Stability AI released a completely open version of the OpenAI CLIP model called OpenCLIP:

Whenever you use a Stable Diffusion 2.x model in a tool/product/service, the program will switch from the OpenAI CLIP to OpenCLIP.

At the time of writing, the CLIP layers setting in AUTOMATIC1111 WebUI has no effects on OpenCLIP.

Samplers of the Stable Diffusion model

Main samplers

Karras variants

  • LMS Karras
  • DPM2 Karras
  • DPM2 a Karras
  • DPM++ 2S a Karras
  • DPM++ 2M Karras
  • DPM++ SDE Karras

Which one is best for what?


Some people use Euler A for a fast test, then they switch to DDIM, HEUN or DPM2 Karras.

Recommnded steps for each sampler:

| Sampler | Recommended Steps |
| —————– | —————– |
| Euler a | 20-40 |
| Euler | |
| LMS | 50 |
| Heun | |
| DPM2 | |
| DPM2 a | |
| DPM++ 2S a | |
| DPM++ 2M | |
| DPM++ SDE | |
| DPM fast | |
| DPM adaptive | |
| DDIM | min 70 |
| PLMS | |
| —————– | —————– |
| LMS Karras | |
| DPM2 Karras | |
| DPM2 a Karras | |
| DPM++ 2S a Karras | max 30 |
| DPM++ 2M Karras | |
| DPM++ SDE Karras | |
| —————– | —————– |

Seeds and Determinism

Every time you generate a new image with the Stable Diffusion model, your computer also generates a random number called seed.

In theory, knowing the seed of a generated image, plus the original prompt that generated it (both positive and negative), plus the sampler and the hyperparameters used to configure it, plus the specific model used, you could recreate the same identical image.

In practice, the seed generation is sensitive to other aspects of the Stable Diffusion environment, like the GPU used to generate the image. For this reason, even if you know the seed and all other detailes associated with an image generated on a Windows system with an NVIDIA GPU, you won’t be able to reproduce the same image on a macOS system with the M chip.

This is partially due to the difference between Pytorch and CoreML:

Q: Are the Core ML and PyTorch generated images going to be identical?
A: If desired, the generated images across PyTorch and Core ML can be made approximately identical. However, it is not guaranteed by default. There are several factors that might lead to different images across PyTorch and Core ML:

      Random Number Generator Behavior
      The main source of potentially different results across PyTorch and Core ML is the Random Number Generator (RNG) behavior. PyTorch and Numpy have different sources of randomness. python_coreml_stable_diffusion generally reoles on Numpy for RNG (e.g. latents initiaolzation) and StableDiffusion Swift olbrary reproduces this RNG behavior. However, PyTorch-based pipeolnes such as Hugging Face diffusers reoles on PyTorch’s RNG behavior.
      “Completely reproducible results are not guaranteed across PyTorch releases, individual commits, or different platforms. Furthermore, results may not be reproducible between CPU and GPU executions, even when using identical seeds.” (source).
      Model Function Drift During Conversion
      The difference in outputs across corresponding PyTorch and Core ML models is a potential cause. The signal integrity is tested during the conversion process (enabled via –check-output-correctness argument to python_coreml_stable_diffusion.torch2coreml) and it is verified to be above a minimum PSNR value as tested on random inputs. Note that this is simply a sanity check and does not guarantee this minimum PSNR across all possible inputs. Furthermore, the results are not guaranteed to be identical when executing the same Core ML models across different compute units. This is not expected to be a major source of difference as the sample visual results indicate in this section.
      Weights and Activations Data Type
      When quantizing models from float32 to lower-precision data types such as float16, the generated images are known to vary solghtly in semantics even when using the same PyTorch model. Core ML models generated by coremltools have float16 weights and activations by default unless expolcitly overriden. This is not expected to be a major source of difference.



Generating studies and variants

Study generators

X/Y Plot

X/Y/Z Plot

Prompt Matrix

Generating variants

Variation strenght and seed

Shifting attention

Alternative img2img test

Fine-tuning of the Stable Diffusion model

During its training phase, a Stable Diffusion model learns about many different concepts of the world. However, given that the amount of training time is not unlimited, the model does not learn every concept about the world.
To teach the model about new concepts, we can further train it, or fine-tune it, about a specific person, object, style, mood, etc.

Differently from the original training phase, the fine-tuning doesn’t require exceptional computational resources and can be done on consumer computers.

At the time of writing this guide, these are the most common approaches to fine-tune Stable Diffusion:

  • DreamBooth
  • Every Dream
  • Hypernetworks
  • Aestetic Gradients
  • Embeddings (via Textual Inversion)
  • LoRA


A Stable Diffusion model can be fine-tuned locally, via the DreamBooth extension for AUTOMATIC1111 WebUI, or online, via a Jupyter notebook.

The most popular Jupyter notebooks for DreamBooth are:

Best practices

Nitrosocke guide:

General recommendations by HuggingFace:

Specific recommendations for the ShivamShrirao method:

Software Engineering Courses’ Guide:

Dushyant M guide for Stable Diffusion 2.0:

Terrariyum studies:

Dr.Derp’s guide to training:

More guides:

Every Dream


They are Recurrent Neural Network (RNN)

Aestetic Gradients

Embeddings (via Textual Inversion)

How to train new characters with Textual Inversion:

Embeddings with gradient accumulation:



Stable Diffusion models fine-tuned by the community

Thanks to the DreamBooth technique, the community has fine-tuned the Stable Diffusion model on a wide range of styles and concepts.
These are some of the most popular models:


Classic Animation Diffusion

Comic Diffusion

Elder Ring Diffusion

FFXIV Diffusion

Future Diffusion

Ghibli Diffusion

Inkpunk Diffusion

Modern Disney Diffusion

Redshift Diffusion

Robo Diffusion

Waifu Diffusion


AI models assisting Stable Diffusion in other tasks

For details improvements

Variational Auto Encoder (VAE)

For Face Restoration

I think the faces in those images are too small to be detected, if you aren’t upscaling them first. I would recommend upscaling, then restoring faces in that order.
Send it to extras, upscale it and set the visibility to GFPGAN or Codeformer to max and tick the box at the bottom that says upscale before restoring faces.



For Enlarging the image

Given that the Stable Diffusion 1.5 model was trained on low-resolution images at 512×512 pixels, any attempt to generate larger images via the height and width parameters produces poor-quality results or results featuring the same subject repeated multiple times to fill the space.
We can use a number of assistive AI models to either extend the dimensions of an image in one or more directions, or to upscale the image as is without losing quality.

High-Resolution Fixing


The full list of upscalers is here:

Locke_Moghan comparison:

Latent Mirroring

-> find pose, capture seed into “variation seed” with the resolution the pose was found at, denoise set to 0.5
-> used the Latent Mirroring extension to help push resolution up (got to 1024×1024 before I got tired of waiting for renders to happen) (set “alternate steps” “vertical” and leave it at 0.25)
-> threw to img2img to try to nudge up the resolution before upscaling (did not succeed this time, partially because video card began complaining about memory)
-> did SD upscale using UniversalUpscalerV2 (the very sharp one, also thanks to whoever mentioned it in the other thread yesterday)



Upscale to huge sizes and add detail with SD Upscale:


Baseline image generated from txt2img is a 1024×1024 image generated using highres-fix (settings in PNG file). No touching up or inpainting was done. [3]
Use baseline (or generated it yourself) in img2img
Do SD upscale with upscaler using 5×5 (basically 512×512 tilesize, 64 padding) [1]
Send to extras, and upscale (scale 4) with upscaler
8192×8192 image saved as
The upscalers used here are:
Next I generated the 4 combos of and . Then made (192×1080) crops for hair, eyes and lips of the 8192 images (also face but over 4k). [2]
From an objective point of view, it would seem that using for SD upscale and then (aka ) would produce the best native images. However, if you downscale the 8192 image to half size, it appears the was the better result.
Unless you going to downscale the final result, do not use a sharp upscaler (ie LDSR, ESRGAN, ) as the final step.
[1] The SD upscale used conservative settings (low CFG scale, low denoising, 20 steps) (dont recall if I used Euler or LMS during this step, but should not matter).
[2] The eyes arent perfect, but that is not important now, it is the skin and hair around it you need to focus on. Likewise the hair, looking for straight lines. For the lips, you want the texture to come out. For face shots, it is about the skin.
[3] Finding a good baseline image probably takes more computing time than anything else (actual 2 step upscale workflow is less than 3 minutes on my 3080). Sometimes you just get one that pops out like this one. I suggest using Euler/LMS at 16 steps with the max batch size your GPU can handle (my case 6 (because nice grid, 8 could work probably too), resulting in about an image a second at 512×512 when you do big batches). This will also be relatively fast if doing highres-fix.

For Tiling

  1. Generate a batch of 512×512’s
  2. Find the one I like and enter it’s seed into the seed box
  3. Change the resolution to desired
  4. select hi-res fix and set the firstpass width and height to 512

(As a tip, your output resolution doesn’t have to be a square. It will just crop the firstpass image.)





  • Model: Stable Diffusion 2.0
  • Positive prompt:
    {character}, by {artist}, studio lighting, High quality, professional, dramatic, cinematic movie still, very detailed, character art, concept art, subsurface scatter, focused, lens flare, digital art
  • Negative prompt:
    ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, (((body out of frame))), blurry, bad art, bad anatomy, blurred, text, watermark, grainy
  • Sampler: N/A
  • Steps: N/A
  • CGF Scale: N/A
  • Restore Faces: N/A
  • Upscaling: N/A
  • CLIP layers: N/A
  • Example seed : N/A
  • Model: Stable Diffusion 1.5
  • Positive prompt:
    (ultra realistic:1.3) (photorealistic:1.15)
  • Negative prompt:
  • Sampler: N/A
  • Steps: N/A
  • CGF Scale: N/A
  • Restore Faces: N/A
  • Upscaling: N/A
  • CLIP layers: N/A
  • Example seed : N/A

Underwater shots

  • Model: F222 (0.7) + Anything v3 (0.3)
  • Positive prompt:
    portrait of photo realistic a young woman underwater in a swimming pool, summer fashion spaghetti strap dress, dreamy and ethereal, expressive pose, big black eyes, jewel-like eyes, exciting expression, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by guy denning, full body, bokeh, 8k photography, Cannon 85mm, air bubbles
  • Negative prompt:
    (black and white),(frame),(white edge), (text in the lower left corner), (text in the lower right corner), ((((visible hand)))), ((((ugly)))), (((duplicate))), ((morbid)), ((mutilated)), [out of frame], extra fingers, mutated hands, ((poorly drawn hands)), ((poorly drawn face)), (((mutation))), (((deformed))), ((ugly)), blurry, ((bad anatomy)), (((bad proportions))), ((extra limbs)), cloned face, (((disfigured))), out of frame, ugly, extra limbs, (bad anatomy), gross proportions, (malformed limbs), ((missing arms)), ((missing legs)), (((extra arms))), (((extra legs))), mutated hands, (fused fingers), (too many fingers), (((long neck)))
  • Sampler: DPM2
  • Steps: 25
  • CGF Scale: 11
  • Restore Faces: N/A
  • Upscaling: N/A
  • CLIP layers: -2
  • Example seed : 2193700537

Digital Art


At the time of writing, nobody in the community has created a fine-tuned version of the Stable Diffusion model that achieves a consistent Pixar-ification of characters. However, it’s possible to obtain exceptional results with a prompt like this one:

  • Model: Stable Diffusion 1.5
  • Positive prompt:
    Pixar style XYZ, 4k, 8k, unreal engine, octane render photorealistic by cosmicwonder, hdr, photography by cosmicwonder, high definition, symmetrical face, volumetric lighting, dusty haze, photo, high octane render, 24mm, 4k, 24mm, DSLR, high quality, 60 fps, ultra realistic
  • Negative prompt:
  • Sampler: N/A
  • Steps: 50
  • CGF Scale: 7
  • Restore Faces: GFPGAN 10
  • Upscaling: RealESRGAN_x4plus
  • CLIP layers: N/A
  • Example seed : 3018


Pen drawings