Why everyone should pay attention to Stable Diffusion
Many of the people in my circles hadn’t heard of Stable Diffusion until I told them, and I was already two days late. Heralds of new technologies have a tendency to play up every new thing, however incremental, as the dawn of a new revolution – but in this case, their cries of wolf may be real for once.
Stable Diffusion is an AI tool produced by Stability.ai with help from researchers at the Ludwig Maximilian University of Munich and the Large-scale AI Open Network (LAION). It accepts text or image prompts and converts them into artwork based on, but not necessarily understand, what it ‘sees’ in the input. It created the image below with my prompt “desk in the middle of the ocean vaporwave”. You can create your own here.
But it strayed into gross territory with a different prompt: “beautiful person floating through a colourful nebula”.
Stable Diffusion is like OpenAI’s DALL-E 1/2 and Google’s Imagen and Parti but with two crucial differences: it’s capable of image-to-image (img2img) generation as well and it’s open source.
The img2img feature is particularly mind-blowing because it allows users to describe the scene using text and then guide the Stable Diffusion AI by using a little bit of their own art. Even a drawing on MS Paint with a few colours will do. And while OpenAI and Google hold their cards very close to their chests, with the latter even refusing to release Imagen or Parti in private betas, Stability.ai has – in keeping with its vision to democratise AI – opened Stable Diffusion for tinkering and augmentation by developers en masse. Even the ways in which Stable Diffusion has been released are important: trained developers can work directly with the code while untrained users can access the model in their browsers, without any code, and start producing images. In fact, you can download and run the underlying model on your system, requiring some slightly higher-end specs. Users have already created ways to plug it into photo-editing software like Photoshop.
Stable Diffusion uses a diffusion model: a filter (essentially an algorithm) that takes noisy data and progressively de-noises it. In incredibly simple terms, researchers take an image and in a step-wise process add more and more noise to it. Next they feed this noisy image to the filter, which then removes the noise from the image in a similar step-wise process. You can think of the image as a signal, like the images you see on your TV, which receives broadcast signals from a transmitter located somewhere else. These broadcast signals are basically bundles of electromagnetic waves with information encoded into the waves’ properties, like their frequency, amplitude and phase. Sometimes the visuals aren’t clear because some other undesirable signal has become mixed up with the broadcast signal, leading to grainy images on your TV screen. This undesirable information is called noise.
When the noise waveform resembles that of a bell curve, a.k.a. a Gaussian function, it’s called Gaussian noise. Now, if we know the manner in which noise has been added to the image in each step, we can figure out what the filter needs to do to de-noise the image. Every Gaussian function can be characterised by two parameters, the mean and the variance. Put another way, you can generate different bell-curve-shaped signals by changing the mean and the variance in each case. So the filter effectively only needs to figure out what the mean and the variance in the noise of the input image are, and once it does, it can start de-noising. That is, Stable Diffusion is (partly) the filter here. The input you provide is the noisy image. Its output is the de-noised image. So when you supply a text prompt and/or an accompanying ‘seed’ image, Stable Diffusion just shows off how well it has learnt to de-noise your inputs.
Obviously, when millions of people use Stable Diffusion, the filter is going to be confronted with too many mean-variance combinations for it to be able to directly predict them. This is where an artificial neural network (ANN) helps. ANNs are data-processing systems set up to mimic the way neurons work in our brain, by combining different pieces of information and manipulating them according to their knowledge of older information. The team that built Stable Diffusion trained its model on 5.8 billion image-text pairs found around the internet. An ANN is then programmed to learn from this dataset as to how texts and images correlate as well as how images and images correlate.
To keep this exercise from getting out of hand, each image and text input is broken down into certain components, and the machine is instructed to learn correlations only between these components. Further, the researchers used an ANN model called an autoencoder. Here, the ANN encodes the input in its own representation, using only the information that it has been taught to consider important. This intermediate is called the bottleneck layer. The network then decodes only the information present in this layer to produce the de-noised output. This way, the network also learns what about the input is most important. Finally, researchers also guide the ANN by attaching weights to different pieces of information: that is, the system is informed that some pieces are to be emphasised more than others, so that it acquires a ‘sense’ of less and more desirable.
By snacking on all those text-image pairs, the ANN effectively acquires its own basis to decide when it’s presented a new bit of text and/or image what the mean and variance might be. Combine this with the filter and you get Stable Diffusion. (I should point out again that this is a very simple explanation and that parts of it may well be simplistic.)
Stable Diffusion also comes with an NSFW filter built-in, a component called Safety Classifier, which will stop the model from producing an output that it deems harmful in some way. Will it suffice? Probably not, given the ingenuity of trolls, goblins and other bad-faith actors on the internet. More importantly, it can be turned off, meaning Stable Diffusion can be run without the Safety Classifier to produce deepfakes that are various degrees of disturbing.
Recommended here: Deepfakes for all: Uncensored AI art model prompts ethics questions.
But the problems with Stable Diffusion don’t lie only in the future, immediate or otherwise. As I mentioned earlier, to create the model, Stability.ai & co. fed their machine 5.8 billion text-image pairs scraped from the internet – without the consent of the people who created those texts and images. Because Stability.ai released Stable Diffusion in toto into the public domain, it has been experimented with by tens of thousands of people, at least, and developers have plugged it into a rapidly growing number of applications. This is to say that even if Stability.ai is forced to pull the software because it didn’t have the license to those text-image pairs, the cat is out of the bag. There’s no going back. A blog post by LAION only says that the pairs were publicly available and that models built on the dataset should thus be restricted to research. Do you think the creeps on 4chan care? Worse yet, the jobs of the very people who created those text-image pairs are now threatened by Stable Diffusion, which can – with some practice to get your prompts right – produce exactly what you need, no illustrator or photographer required.
Recommended here: Stable Diffusion is a really big deal.
The third interesting thing about Stable Diffusion, after its img2img feature + “deepfakes for all” promise and the questionable legality of its input data, is the license under which Stability.ai has released it. AI analyst Alberto Romero wrote that “a state-of-the-art AI model” like Stable Diffusion “available for everyone through a safety-centric open-source license is unheard of”. This is the CreativeML Open RAIL-M license. Its preamble says, “We believe in the intersection between open and responsible AI development; thus, this License aims to strike a balance between both in order to enable responsible open-science in the field of AI.” Attachment A of the license spells out the restrictions – that is, what you can’t do if you agree to use Stable Diffusion according to the terms of the license (quoted verbatim):
“You agree not to use the Model or Derivatives of the Model:
- In any way that violates any applicable national, federal, state, local or international law or regulation;
- For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;
- To generate or disseminate verifiably false information and/or content with the purpose of harming others;
- To generate or disseminate personal identifiable information that can be used to harm an individual;
- To defame, disparage or otherwise harass others;
- For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation;
- For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics;
- To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;
- For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories;
- To provide medical advice and medical results interpretation;
- To generate or disseminate information for the purpose to be used for administration of justice, law enforcement, immigration or asylum processes, such as predicting an individual will commit fraud/crime commitment (e.g. by text profiling, drawing causal relationships between assertions made in documents, indiscriminate and arbitrarily-targeted use).”
As a result of these restrictions, law enforcement around the world has incurred a heavy burden, and I don’t think Stability.ai took the corresponding stakeholders into confidence before releasing Stable Diffusion. It should also go without saying that the license choosing to colour within the lines of the laws of respective countries means, say, a country that doesn’t recognise X as a crime will also fail to recognise harm in the harrassment of victims of X – now with the help of Stable Diffusion. And the vast majority of these victims are women and children, already disempowered by economic, social and political inequities. Is Stability.ai going to deal with these people and their problems? I think not. But as I said, the cat’s already out of the bag.