Sora is DALL·E for video—it's even built by the same people.
You type a text prompt, and the AI model does its best to generate a matching video. It makes sense that once AI models could generate decent images, the next challenge would be getting them to generate good video footage—and that's what Sora is getting at.
While Sora is still in testing, the results that OpenAI has demonstrated are impressive—though they do have a surreal video-game-like quality.
You can judge for yourself whether they're truly realistic or not. But first, let's have a look at what Sora is, how it works, and how it could be used going forward.
What is Sora?
Sora is a generative text-to-video AI model developed by OpenAI, the makers of ChatGPT and DALL·E 3. OpenAI claims that it "can create realistic and imaginative scenes." I'd argue that "realistic" might oversell things a touch—and they also lack sound, at least for now—but the videos it generates from written prompts do look great.
In addition to using text prompts, Sora can also take an image and turn it into a video, or take a video clip and extend it forward or backward in time.
Sora can create videos that are up to 60 seconds long with multiple characters, camera motion, and persistent and accurate details. Thanks to its training (which I'll dive into below), it has a deep understanding of how things exist in the real world—if not always how they physically interact.
How does Sora work?
Sora is built on the ideas behind Open AI's DALL·E and GPT models, as well as plenty of novel innovations.
It was trained on an unspecified amount of video footage that appears to include everything from selfie videos to movies, TV shows, real world footage, video game recordings, and lots more. All this training footage was captioned, mostly by AI, so that Sora could develop a deep understanding of natural language and how it relates to the physical world.
Tokenizing visual data with patches
In the technical report, the OpenAI researchers explain that they were inspired by how large language models (LLMs) like GPT are able to become incredibly competent at a wide variety of tasks just by being trained on massive quantities of data.
A big part of this is because LLMs model the relationships between individual "tokens"—fragments of meaningful text roughly four characters long—across different domains, including multiple languages, mathematics, and computer code. Feed in billions of web pages, and they have a structure they can use to sort things out and incorporate it.
To achieve some of the same benefits with video, OpenAI uses "spacetime patches." In essence, every frame in a video is broken down into a series of smaller segments called patches. How each segment changes through the length of the video is also encoded in the spacetime patch—hence the name, spacetime. Crucially, this allowed Sora to be trained on a wide variety of different visual data, from vertical social media videos to widescreen movies, as each clip didn't have to be cropped or compressed to a specific set of dimensions.
It gets really complicated really quickly, so if you want to learn more, check out the technical report or this article from Towards Data Science, or keep reading for a few more details.
Generating patches with a transformer diffusion network
To generate videos, Sora uses the same diffusion method as DALL·E, with a transformer architecture similar to GPT, enabling it to generate long, detailed, multi-composition clips.
Diffusion starts with a random field of noise, and the AI repeatedly edits it so that it gets closer and closer to the target prompt. It sounds wild, and I explain it in more detail in Zapier's roundup of the best AI image generators, but it works really well with modern image models. It's how Stable Diffusion, Midjourney, DALL·E 3, and every other AI art generator is able to create such interesting results.
Sora's biggest development is that it doesn't generate a video frame by frame. Instead, it uses diffusion to generate the entire video all at once. The model has "foresight" of future frames, which allows it to keep generated details consistent throughout the entire clip, even if they move in and out of frame, are obscured by other objects, or the virtual camera moves through 3D space.
Check out a few of OpenAI's sample videos here and here, and you'll see all this in action. The clips generally look to have consistent details without too many weirdly generated artifacts.
For a deeper dive into the technology behind AI, here are some resources:
What can Sora be used for?
At its most basic, Sora can be used to generate videos from text prompts. How useful this is in the real world remains to be seen. AI-generated images haven't replaced photographers and other artists, but they're definitely being widely used—especially online.
But if OpenAI's preview is to be believed, Sora can do a whole lot more:
It can convert static images and drawings into videos.
It can add special effects to existing images and videos.
It can extend videos both forward and backward in time.
It can convert any video clip into a seamless loop.
It can interpolate between two unrelated video clips.
It can edit existing videos, replacing the background or subject with something else.
Some of these features, at least, have the potential to enable people to create and generate new kinds of videos, at least without resorting to video editing and special effects programs like Adobe After Effects.
Of course, typical of OpenAI's grandiose/futuristic vision, Sora isn't useful just for creating video. It can apparently simulate artificial processes like video games, and as a result, the researchers feel that the "continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world, and the objects, animals and people that live within them." If the Metaverse finally takes off, we might have Sora to blame.
Of course, there's the potential for deepfakes. While existing video editing and AI tools already make them easy to create, text-to-video AI models could supercharge the ability of unscrupulous people to generate them with little to no effort. The video quality isn't quite convincing yet, but that doesn't mean it always won't be—or that some people won't try to pass off AI videos as real anyway.
To OpenAI's credit, they have generally put in strong guardrails that make it hard to misuse and abuse their models, but the same can't be said about other services built using similar open source models. We're certainly looking at the next few years being weird while society as a whole comes to terms with fake videos being easier and cheaper to produce.
How good is Sora?
OpenAI's Sora demos look great, but there are some big caveats hanging over it all.
According to OpenAI, Sora can struggle to accurately simulate physics in complicated scenes and won't always nail cause and effect. The example they give is that someone might take a bite out of a cookie while leaving it whole. Similarly, it's unable to model a glass smashing as it falls over in one of the video demos. The model can also mix up spatial details like lefts and rights, and may not be able to follow "precise descriptions of events that take place over time, like following a specific camera trajectory."
The biggest question mark, though, is just how cherry-picked OpenAI's examples are. If the video demos are a reasonably accurate representation of what Sora can do with a given prompt, it's going to be fascinating when it gets released to the general public. On the other hand, if the clips are just the best of the best, and there's a lot of bad footage left on the cutting room floor, then Sora will be a little less exciting—at least initially. Once OpenAI gets loads of training from people using it, it's likely to rapidly improve regardless.
When will Sora be available?
Sora is currently available to "red teamers," AI researchers who specialize in finding the weaknesses and vulnerabilities in AI models, and, in particular, figuring out how to make them create all kinds of horrible things. OpenAI will then use the results of their testing to train Sora so that it's more suitable for release to the general public.
While there's no clear timeline for that to happen, you can try out a few other text-to-video AI models today. Runway Gen-2 is the big name, but Google's Lumiere and Meta's Make-a-Video are both available as PyTorch extensions if you have the technical chops to run them. Or you can check out Zapier's list of the best AI video generators.
Otherwise, I'd recommend just heading to the Sora page to see tons of examples of the tool in action.
Related reading: