Black Forest Labs on FLUX
A16Z did a podcast with Black Forest Labs on their new image generation AI model: FLUX.
Possibly, their first. Highly interesting because it’s full of insights about the decisions behind FLUX, how it’s different from Stable Diffusion, and where they are focused next.
Transcription
Welcome to the A16z AI Podcast. I’m Derek Harris.
This week, we have a very interesting discussion between A16z general partner Anjney Midha and the co-founders of a new generative AI model startup called Black Forest Labs, which they recorded live and in person in, as the company’s name might suggest, Germany.
The founding team, Robin Rombach, Patrick Esser, and Andreas Blattmann, drove the research behind the stable diffusion models, and recently started Black Forest Labs to push the envelope of image and video models, and to help keep the open research torch lit.
In addition to discussing their new family of models, called Flux, Robin, Andreas, Patrick and Anjney also went into the transition from research to product, and then from building products to starting a company.
In addition, they address the benefits of open research in AI and why it’s important to learn from the greater community, rather than develop behind closed doors.
But before we get started, here are some brief introductions from each of them to help you associate their voices with their names. First, Robin.
I’m Robin, co-founder of Black Forest Labs. We’re focusing on making image and video models as widely available as possible.
Then Patrick.
Patrick Esser. I’m one of the co-founders of Black Forest Labs. I’ve been working in this area for a while, started at the university.
I got excited when I saw the possibility that we can actually teach computers to create the images.
Finally, Andreas.
Hi, my name is Andreas.
I’m amongst the co-founders of Black Forest Labs. A couple of years ago, I started with those two guys working on image and then later video generation.
So you guys along with Dominik were four of the co-authors on Stable Diffusion. Why don’t we go back all the way to the origin story? Where did you guys all meet?
Yeah, we met at the University of Heidelberg, where we all did our PhDs or tried to get our PhDs. And I actually met Andreas there who’s from the Black Forest, just as me, village next door, basically. We didn’t know each other before, but then we met in Heidelberg during the PhD.
And it was a really nice time. We did a bunch of, I would say, pretty impactful works together. Starting with, we started with normalizing flows, actually.
Tried to make them as good as possible, which was hard, which would probably still be hard. Then switched to autoregressive models, did this work that’s called VQGAN. And then later on, after this DDPM paper, which really showed that diffusion models can generate nice images, we also looked into that and applied the same mechanism, the same formalism that we were working on before.
We just latent generative modeling technology, where basic assumption is that when you want to generate media like images or videos, there’s a lot of redundancy in the data. That’s something that you can basically compress away, map the data into a lower dimensional latent space, and then actually train the generative model, which can be a normalizing flow, an autoregressive model or diffusion model on that latent space, which is computationally much more efficient. We did that then basically with latent diffusion.
We did a bunch of tweaks to the architecture, introduced a text conditional unit, and we’re amongst the first to do text-to-image generation with diffusion models.
If you guys think back to that moment in time, when it wasn’t obvious maybe that diffusion models would be so good at various kinds of modalities, image generation, video generation, audio generation. That’s more clear today, but was it as clear back then?
What were the biggest debates you guys were having as a group back then?
I don’t know. Refusing comments from RPI back then.
Oh, interesting. It wasn’t clear to more senior academics at the time. I don’t think so.
But that might be my personal perception.
Why is that you think? What was the general reaction from more established researchers, academics, so unaccepting?
I think it has something to do with what Patrick mentioned earlier, just the fact that if you look at it from very far away, you were just training an autoencoder and then train your generative model in that latent space. That’s a very simplified view of it, because that’s not the entire story. The fact why the stuff is really working and producing crisp images is that when you train the autoencoder, we had to introduce this adversarial component to it, which makes it look really crisp, like natural images and not blurry like the stuff was before.
This shares a very similar motivation to why diffusion models in the CDPM paper originally worked. You focus on the perceptually irrelevant stuff when you want to generate but you discard certain perceptually irrelevant features. I think we also had to develop this mindset or theory around our intuition while we actually worked on it.
This wasn’t like the moderation from the beginning, but now in retrospect, I think it just makes a lot of sense. I think that might be one of the reasons why it was constantly challenged, why would you work on this? Why would you do another latent approach again?
Now with the diffusion model, I think we had to debate ourselves.
Yeah, I was worried if we can do another one of those.
That’s where you see where the limits of research are. You have to propose something novel. If it just works better and it’s not like to everyone clear that it’s novel, then it will be questioned in some form.
But as opposed to that, if you’re building a business, you just focus on what works. That the kind of novelty is not as important anymore. It’s just like you use what works.
That’s why starting a business is actually also a really nice experience.
Even before you guys got to starting a business, if you just think about the difference between research and product and just building tools that people can use outside of a paper, what we have seen not novel to you while you were in the research community was actually extraordinarily novel to creators and developers around the world. And it wasn’t really until you guys put out a few years later, stable diffusion, that may have become clear to the research community.
Is that right or is that the wrong framework?
No, I think that’s exactly right. The thing is a nice intermediate step between doing research and doing business which is working on models that are being used in an open source context because then everybody will use your models.
Right.
And we made that experience pretty early because we were just used to making all models available all the time. And one of the first ones that got pretty popular was this VQGAN decoder, which because it actually achieved pretty realistic image textures, it was used in combination with this text to image optimization procedure where people use that and clip and optimize the image to match the text prompt. And because we had put out the model and was used in this context by lots of people, that was like one of these moments where we realized, okay, you actually have to make something that works in general.
And then I think it’s a nice intermediate step, because if you want your models to be used in this wide context, then you just have to make sure that they work in a lot of edge cases.
Let’s spend a little bit of time on that moment in history, because it really has had an incredible impact on lots of communities, well beyond academia and research. So we’re going to play a quick guessing game. August 2022, you guys put out StableDiffusion v1.4.
Just to give people a sense of the scale of StableDiffusion and the impact that had. Can you guys guess how many downloads the model had a month after launch?
I said before, I hate guessing.
Yeah, why don’t you go first?
120k?
120,000. Patrick?
A million, but we don’t know how the downloads are counted.
Oh, fair enough. These are estimates from the Hugging Face repo. So 120k, a million, Robin?
2 million.
2 million, okay. In its first month, StableDiffusion v1.4 was downloaded 10 million times.
Holy shit.
Today, StableDiffusion has had more than 330 million downloads since you guys put it out in the summer of 2022. StableDiffusion basically changed the world. It’s now one of the three most used AI systems in history, which is incredible to think about.
It’s also incredible to think about the fact that you guys are just getting started. So why don’t we talk about, that was the past. Now let’s talk a little bit about the present.
So you put out StableDiffusion, you see the incredible reception from the community, the sheer scale of usage, the kinds of usage, the things people are doing with it. What would you say the top three things are that have surprised each of you?
One thing that has come to my mind is just in general, this massive exploration that you get by having so many people use it. And one of the first things I think I was surprised was the use of negative prompting. Also again goes to CFG, but it’s like a slight variation of that.
I think we also never really explored. And then you saw that people actually got really improved results with it. It was like, oh nice, such a nice quick find that we might never have discovered on our own.
Right.
Yeah. I remember how like after the release, I went on vacation for two weeks in Sweden. And I had this, there were a bunch of papers, where I was like super curious to try it out.
And after I came back, it was all implemented already.
Do you think that was primarily because you guys chose to release it as open source?
Exactly, because it was available, because the base quality was sufficiently good to explore all of these downstream applications.
So let’s spend a minute on that, because in the language world, arguably as the impact of language models has become more and more clear, the visibility and the transparency with which researchers and language talk about their research breakthroughs has decreased. The vast majority of leading labs today don’t publish their insights until much later.
They don’t really publish their findings.
In contrast, in the generative image and video model community, you guys have chosen to continue open sourcing or desublishing your research and transparently talking about it.
Do you think that was a deliberate decision from you guys, or was that just an artifact of something else?
So I think seeing what you get back from the community in terms of ideas, and which you of course can then incorporate into your next iterations, that is so nice and so helpful. So I think it’s definitely a personally important thing for us to keep doing that, to keep giving the community something they can build on. And it’s also of course, as we already said, extremely insightful and fun to just see what they come up with.
On the other hand, especially for the AI space, we’ve of course also see companies following that approach, struggling to make real revenue with it, and just getting into trouble in many ways. So yeah, I think one thing everyone should keep in mind, who is interested in models which are openly available, is that there needs to be a kind of equilibrium between the people who are using the models, the open models, and the people who are putting them out, which would be us. So we have to make sure we are sustaining as a business also, of course.
And so now it’s been a couple of years since that first stable diffusion release, where you put out v1.4, you saw the community do a bunch of exploration. Then that gave you the ability to decide which parts of what the community is working on, you wanted to double down on, improve the quality on and so on, and then release the next version. And you’ve done that a few over a family of models now.
There was SDV 1.4, then there was SDXL, SD3. What would you say is the biggest takeaway for you, having gone through that journey a few times of releasing open weights and seeing what the community does with it?
I think there’s always this possibility that you integrate findings, at least pure research findings, back into your models. But then on the other hand, one thing that we also had to learn is just scaling our infrastructure that we need for training. This is typically something that is really not being talked about that much, and this is really where you can distinguish it, I think, yourself.
Just training a better base model requires massive commitments how you design your training pipeline, right? When comes this data pre-processing in all different forms, data filtering, of course, then the training algorithm itself has to support large clusters, and all these different things which are not directly being done in the community, but which is super important if you want to make a good base model. And right now we’re in this phase where we’re also scaling up our models massively.
Okay, so that brings us to present day. You guys are the co-founders of Black Forest Labs. What is Black Forest Labs?
We are a startup that focuses on image and video generation with the latent generative modeling. We are a research team that has been worked together for more than one year now. And I think we’re, as Robin already said, really specialized in building very specific training pipelines for these latent generative foundation models.
And I think that is where our team really is unique in terms of capabilities, because we just managed to optimize all parts of our pipeline to an extent, which I think is pretty much outstanding currently.
And what is Black Forest’s mission to North Star?
I think it’s to make the best models available as much as possible. That this really becomes a new way to generate content, that this is available widely for everyone, and that we also figure out how to actually continue this mission of still sharing research findings openly and also the models. But yeah, I think part of our goal to make that a sustainable thing to keep going.
And yeah, as your first release, you guys put out FLUX, which is Black Forest’s first image model. What is FLUX and what does it do?
FLUX is a diffusion transformer. It’s a latent diffusion model. Actually, it’s a latent flow model since we’ve recently switched to this more general formalism that’s called flow matching.
And this model, it improves, we think, a bunch of things over like previous models. So it uses a better form of positional embedding that does contribute better structure we have in the generations. So it’s called ROPE, pretty popular among language models.
But yeah, we incorporated this into image generation. It uses a more hardware efficient implementation. We introduced these, we call them Fused DIT blocks, also actually motivated by findings in the scaling of transformers.
I think it’s actually from the vision community. I think the VIT, it was a scaling VITs to 22 billion parameters, something paper that was published by Google. They had that.
So we did that. There’s a bunch of things around scaling, something that we also explored in the SD3 paper actually, which is called QK normally, also important for training larger models. Did I forget something about the architecture?
I think we also have an optimized noise schedule or noise sampling during training, which we further improved compared to SD3. I think that would be the main point, Patrick.
Yeah, I think it’s also important to note that we are really, I would say it’s the first round of experiments of putting out different variants of the model that come with different licenses. We offer different variants ranging from very primitive licenses to then also other models that are not completely freely available, which we also want to offer for customers that have more specific needs in the near future and also customize towards more specialized applications.
Who are you hoping will use those different variants? And what’s the biggest difference between these three?
So they are differentiated in terms of inference efficiency. So we think the model that is the most open, the fastest variant is very developer-friendly, just by the fact that you can generate samples in one to four steps compared to usually use something like 30 to 100. And given that the model is also quite large compared to previous models, we think that’s an important feature.
So just to recap, your most open model, Flux Schnell, which is a descriptor for how fast it is an open-weight model. It’s available to the entire community in a very permissive license. And your hope is that developers will use the fastest model to do what?
To include it in workflows that include image generation, include all different kinds of synthesis. We’ve seen this with existing models like in the past, SDXL that is being included. Really, you can look up crazy, comfy workflows that have this model in the pipeline.
Of course, we think because the model itself is just fundamentally better than the models before, that you don’t need most of these somewhat complex workflows. But I can very well imagine that because the model is efficient, that you can plug it in and develop nice workflows around it.
Hopefully, we also see a lot of exploration around applications that are popular. Based on that, I guess we can really gather feedback on what is actually holding those applications back further, and then we can specialize and double down on those.
And if I’m an application developer building on top of Black Forest’s models, if I’m choosing the fastest model, what am I trading off?
You’re not necessarily only trading off things. It’s also the advantage of having speed. I guess one of the biggest issues, though, that all the hosting is on you.
You need to have the hardware to run it, right? Especially if you want to scale it. And that’s something where we also offer solutions that this actually doesn’t become a bottleneck to explore applications.
The other one is a bit in terms of flexibility, because in order to make this very fast, it is a distilled version, which samples in a few steps. But there are some techniques that actually become possible because of the nature of a diffusion model that it samples in multiple steps, because you can adjust things along that way, that sampling process, and those are then not necessarily possible directly with the Schnelles model.
So maybe one could put it like this, if you want to quickly test something, quickly try something, if things make sense in general for you, use a Schnell model. If you have a more specialized application, which is targeted to a certain goal, you might use one of the slower, but more flexible models. I think you could describe it as different levels of distillation, which has been applied to these models.
The largest model is a purely undistilled base model, which offers all the flexibility that comes from that flow matching training procedure which we’re applying. But of course, you trade that off versus generation speed.
A pretty controversial decision you’ve made relative to when you compare with how a lot of other labs in the space are putting out state-of-the-art image models is, you guys chose to make one of these models extremely permissive and open weight. Why is that?
Why was it important to you guys to continue putting out models that are fully open weight licenses?
We benefit a lot also from findings from all the research that is being published. Also with other tools like read and handle a lot on PyTorch just as one example. A lot of these things just wouldn’t be possible if we all just completely isolate our findings.
So I think that’s in general really the important part that we do still share research findings and make it possible to experiment with the new technologies. I think for the open-waits topic, that’s really important that you actually not only have the research findings written down published one, which is also super helpful, but I think to really enable a much wider audience to actually experiment with that technology and for that it actually has to be available.
And yes, as Patrick has said earlier, we have pretty deep roots in open source and we want to continue to do this. And I do think that there’s this huge debate around safety in the context of deep learning models. I do really think that making weights available makes it ultimately down the line much safer.
So I think that’s just like another aspect to the open sourcing, having this community effort focusing on downsides of the model, on stuff that you need to improve, in contrast to something where you just develop it on your own.
Yeah, one of the things that stable diffusion really did for a lot of users was that the base model you guys put out was extremely flexible. It was a fairly honest model. There weren’t too many of your own post-training biases or censorship decisions that you put on it before giving it out to the community.
And you continued that with this release. Why is that important in your mind?
Because I think down the line, it improves the models that everyone is producing. I think this fair exchange of arguments that are being based on research that you do with these specific model weights. It’s like biases in the first versions of Stability Fusion that were introduced by the training data itself and that I’m really not, I don’t like them.
So it’s good that there was research around this which could point them out.
And that, by the way, wouldn’t have been possible without putting out that model.
Because without Stability Fusion, maybe the community would today not know that these biases in the data sets exist.
And by now, we know how to remove that. So we’ve had huge learnings from that, actually. And that is a perfect example of how open models in general are very useful to improve the general space or the general state of the art.
You’re saying when you put out open weight models, that allows other researchers to actually contribute to the transparency of these models, understand the systems more deeply, and then ultimately help improve them by having way more people actually be able to analyze what the models can and can’t do, inherent biases that might not otherwise be as easily discovered if it was a closed source model.
Exactly.
One common perception about open source or open weight models is that the researchers and developers open sourcing these models don’t care a lot about safety or about mitigating some of these risks. Is that true?
Was there anything you guys did before you released Flux as an open source model that you think could address some of these misinformation risks?
Yeah.
We are looking into methods that watermark the content that is being generated, but that you don’t see the output. But then another algorithm could detect if that image or video was made by our neural network.
Yeah. I think that’s a good point. I think it also goes into the direction of maybe more healthy approach towards this, that for example, tracking this and identifying misinformation makes it possible without limiting the technology to other uses that might actually be beneficial.
Another point coming to the point of watermarking, this is obviously a really challenging task because you can apply so much distortions to the generated images and to the watermarked images to just shade break these watermarking procedures. But also there, if you have an open model, there will be jail breaks, but there will be ways to mitigate those jail breaks. This is what we see in many other research field.
If you think about cryptography or something, there it’s basically similar. You just improve your algorithms, then you have some people who jailbreak it and then you improve further. No one certainly doubts that cryptography is really important for everything like we have on the web and whenever exchanging information.
And there it’s just similar. And no one debates about whether open research is good or not. I’m wondering why that is the case for these AI infrastructure models where it’s like effectively the same, I would say.
And you did also share in your launch blog post that you’re working on a video model. And when you were starting to work on this video model, what were some of the most important capabilities that you guys wanted to tackle?
So I think one of the learnings of what we saw to current powerful video models is that although they are really nice, generating really nice and detailed videos, they are still not controllable enough in many respects to be really useful for professionals, for people who want to seriously include that into their professional pipelines.
And when you say controllability, what do you mean?
There are different kinds of challenges, which first of all is the general level of prompt following. So most of these models which you saw right now are based on text inputs, but there is still other than for images where we have found nice ways of, or where the prompt adherence is currently much better for video. It’s unclear how to temporally prompt the model such that it accurately follows your temporal instructions.
So that is one of the main challenges. Another one is consistency of certain objects or characters between different cuts. A movie maker might want to have a cut and still be able to generate the same person, bring the same glow thing, or having same backgrounds and stuff.
Maybe from another view angle, but still the same setting.
Then we think that this is one of the nice features for this new model that we can actually control not only through text, but we can actually say, okay, let’s do a cut here and here. A character or whatever you have, it can be a bottle or whatever. In your prompt, it remains consistent across these different cuts that are being generated, like within a single generation.
Relative to the last video model that you guys worked on, which I believe was stable video diffusion last fall, is the biggest improvement you would say in Black Forest first video model that it’s much more controllable?
Not only that, it’s also much more efficient. The latent space is more or less 16 times more efficient, which is really, I think, good, while keeping the general video quality, visual quality. Also, related to that, we can generate much longer videos, and we have, I think, a main issue with stable video diffusion was that it was mainly generating static scenes.
Our current model has a lot of motion, very interesting motion, very broad range of motions from slow-mo over fast footages and shaky camera. Yeah, I think the motion distribution that model is able to generate is heavily improved compared to SVD.
This was a common problem with a whole generation of models that were ostensibly video models, but when you would actually try to run any kind of interesting inference to them, they would often produce the static camera pans or just zooms, and they weren’t actually simulating the world that the image or the video was of. What did you have to do to crack that problem?
I think one of the major things is, major improvements is around this temporal compression that Anjney mentioned, which is also used by other new models. We think that this is one of the fundamental improvements where we see much better video models nowadays than we did have nine months ago.
Also comes down to a lot of data filtering and preparation improvements there. I think it’s actually nice in general because there, for example, we used actually just very, I would say classical computer vision techniques to filter out the worst parts that introduced this undesirable behavior. I think that’s neat to see also that existing techniques can also be really helpful just for sometimes even getting that technique we apply has a very high error rate probably.
But if you do this in the pre-training stage, just getting the rough ID right so often already improves the base model so much more than one would expect just from the numbers. That seemed also very effective.
So you had to contrast the data preparation, pre-computing, pre-training, and post-training, fine-tuning, and ultimately inference optimization parts of the entire journey of actually building and releasing a model.
What’s changed the most in how you approached it this time versus let’s say a year ago?
I would say there were tremendous differences in the data pre-processing and in the pre-training stage already, which led to some of the fundamentally different behaviors we see now for our video model compared to previous video models.
Another thing is that we really changed from or made like time a first class citizen before. We always, I think a lot of people were always using a factorized mechanism modeling approach there where you treat space and time differently. And now in the new models that are coming, it’s also just treating all those the same and letting the model, the transformer, actually figure out how to deal with the different.
That is by the way, yeah.
And that is by the way, generality of the transformer as an architecture is really helpful here because we, transitioning from the image model, which is as I already mentioned, the base for the video model. When doing that transition, we didn’t have to change the architecture at all because of that very useful generality of the transformer architecture.
We actually have a little placeholder in the image model, like in the positional embeddings that we added before we even started the image model training, that would later incorporate the temporal positional embedding.
What gave you the conviction to do that?
Because we already had the plan to, the video model was to go from the beginning, but we knew it’s always a good start to start training the image model. But yeah, in to that image model, we then already incorporated design decisions that were informed by the goal of developing the video model.
It sounds like a fundamental assumption you guys made was that a really great image model would be strictly helpful to a great video model. Is that true?
Not 100 percent sure if I would phrase it this way, but image data gives you a different type of diversity and styles that you might not be able to capture with video data.
Artistic, for instance, artistic things which only create in the image data. If you think about artworks and stuff, of course you can make a video of an artwork, but that might be not a very interesting motion. At least the artwork might not move.
I think we also shouldn’t underestimate the need to think about the development plan in itself because first of all, it takes different amounts of compute to train the image or video model. There’s also a bit more experience with image models which makes it a bit more safer and a bit faster to get started. And so I think that was actually also part of the decision to do it that way is that you don’t want to aim for something where you say this will only be ready in 12 months or something.
I think it’s super important to have a continuous progress, but where in the intermediate steps, you also get something really useful like the image model. And from that, I think it really makes a lot of sense to go that route. There’s not much that you lose.
You can much better overlap different developments by starting the image training relatively quickly, then we can work in parallel on all the video data works. And yeah, that just comes down again to the overall efficiency of us as a team and our model development strategy.
To add to that, you see that by the way, on the fact that we started our company four months ago, and we’re already putting our first model out. We had to rebuild everything, but since we, as Robin mentioned a couple of times, we have this really specialized team, which just optimized all the parts of the pipeline, and combine that with the continuous development of features, image features, for instance, which can then be reused for video. These combinations led to, I think, a really good progress, which you’re seeing right now, because we’re putting a really powerful and large model out after four months, which I’m personally very proud of.
I think that what’s not well understood in the research space is just how much intuition still matters, how much taste still matters, and how individual decisions that you make as a team have dramatic impacts on the speed at which you can produce models, the quality that comes out. As an example, I remember talking to you guys a few months ago about how to approach the problem of latency, of slowness in generation.
Video generation is still pretty slow.
It takes a while to prompt something and then see a generation come back. I remember asking you guys, how could we crack that? All three of you immediately said, yeah, we should ask Axel.
Can you say a little bit more about that? Why was it so clear to you that there was a specialized person on the team whose intuition you would go to first versus saying, oh, let’s look that up, it should be common knowledge or let’s look at what the latest conference papers said and so on?
No, I think this is what really makes a difference. You need to have this functioning team, you need to know each other, you need to know that you can trust each other. Axel tells me, we’ll have this model ready in a week, then I trust him.
So I trust Axel as well. But what is it about training these models that’s so difficult that knowledge still is locked up within one or two individuals, where it’s not ubiquitously distributed? And it especially wouldn’t be if someone like you guys weren’t being so transparent about your research.
Yeah, it’s a lot of intuition and experience. The ability to judge where a training run goes, when you look at the early samples in the training, I think that’s super important. And I think this would a lot of team members actually have.
And yeah, on that intuition, we actually called the model that the internal name, when we started to run for the flux model that we just released, YOLO 12b. And I think there were a few decisions to work. Let’s see where it goes.
But overall, we had a pretty good feeling about the whole thing. And that’s a nice way of operating, I think.
Yeah. But I think there’s kind of scarcity in how much experience there can even be. There’s always this, the whole process of training a model just is slow.
I think there’s very little way around that. And then, of course, you try ways around, do it on a smaller scale. But then sometimes you see that also doesn’t really translate to a scaled up version.
I think just going back to the question, why it’s still locked up or so, so important to have a few people with experience. I still think actually it’s somewhat limited how much experience people have. And yeah, really having that hands-on experience and also being able to, like Robin said, to judge continuously during the training, is this going in the right direction or not, without completely blindly trusting lost course.
Let’s say you have to predict over the next two or three years, what will be more valuable for a human to do versus the model to do.
I’m sure there are many different opinions, but something how I see it is that it becomes similar to the speed, like we are to some degree bottlenecked by the speed with which we can train models, actually get feedback of the ideas we put into models to get feedback, whether that’s actually improving it or not. This is like having a slow feedback loop. I think that’s always something that holds people back.
I think that’s very similar for visual media. From what I hear, people in that industry, if you just want to realize something or if you have an idea and you want to put that into action, today I don’t know if you have to actually film this on camera, you have to get the props, you have to get everything set up just to even get an idea for it. Of course, you do like maybe storyboarding and stuff before that, but I think this is something where it’s already super helpful that you can get a much faster feedback loop of maybe giving some visuals to the ideas that are in your head.
It doesn’t have to be the end product, especially right now with the quality gaps that we still see. There’s a lot of demand for perfection and that still requires human craft right now. Maybe it will become less, but ultimately I think it’s to get the ideas out of the head of people into some form of visualized reality.
I agree. It’s a tool to iterate quickly on ideas. But in the end, you as a user, as a human, you have to make the decision what to use, what you don’t use.
I think there’s also the question around taste, curation of sample thread. You can easily generate like 100 samples in no time, but then what do you use for your specific project? That depends on you.
And then there’s also this question around the specific style that kind of emerges in the AIR domain. Is that something we want to keep? I don’t know.
I’m not sure. So I would say it’s mostly, we think of it as a tool that can dramatically speed up certain workflows, but it’s not thought to replace these workflows entirely. And it will change workflows as well.
So where can people go find the models?
You can go to GitHub, use our inference code, download the weights from hiring phase, and of course also use our API. I hope they generate a bunch of really weird stuff. And of course, explore the model, see how it performs in existing workflows, and integrate it into these workflows as well.
Everything is a big upgrade from the latest iteration on available weights, and both excited to see what the community is exploring, but also, of course, hoping that we see more research around those models.
And especially, we already talked about this earlier, that model is a free model, which is not behind an API. Often, these APIs come with native prompt up sampling or stuff. Of course, you can also do this for our model, but we took care of training the model such that it should also react very well to various kind of prompting techniques, such as single words or short prompts, longer prompts, very detailed prompts.
So I guess that’s amongst the coolest features for open models that people can just prompt them as they themselves want to do and explore what works best for them.