Ben's Bites Newsletter
Posts
AI video isn't patchy anymore, It's almost real.

AI video isn't patchy anymore, It's almost real.

February 16, 2024

OpenAI's mogging everyone in the AI space again. Sora, a new AI model from OpenAI spits out videos based on simple text prompts and we are talking minute-long videos that feel insanely real. So what can you dream now, because Sora (sky in Japanese) is the limit.

What’s going on here?

Open AI’s new text-to-video model, Sora, is a major leap for video-generating AI.

Introducing Sora, our text-to-video model.
Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions.
openai.com/sora
Prompt: “Beautiful, snowy… twitter.com/i/web/status/1…
— OpenAI (@OpenAI)
6:14 PM • Feb 15, 2024

What does this mean?

Text to video models have been improving gradually since early last year. We started with Will Smith choking on Spaghetti and started getting scripted history, dragon worlds, and decentish-looking videos. But two problems remain:

The videos are still janky. You can tell it’s AI.
They aren’t much long 4 secs, 10 secs, 20 if you push it.

OpenAI broke the chain of gradual improvement and came in an entire level upgrade (or even many levels maybe) with Sora.

Sora’s videos are smooth, dynamic, consistent and go up to 1 minute. You can get detailed: the style of animation, mood, camera angles, etc. Imagine specifying “Wes Anderson directs a Pixar short about hamsters." Sora aims to deliver.

Sora is not out for use yet. Everyone copy-pasting the demo examples OpenAI released and I can’t blame them, they are unreal 😉.

But let’s dig a bit deeper into Sora’s technical report and see what OpenAI claims:

Sora can create videos in a large range of aspect ratios and resolutions. From widescreen 1920x1080p videos, vertical 1080x1920 videos and everything in between.
Similar to DALL·E 3, OpenAI uses language models (GPT) to turn basic prompts into power prompts getting high-quality videos.
Sora can use images and videos as inputs, not just text. That means:
- It can animate images.
- It can extend videos: backwards and forwards.
- It can edit videos like changing the scene with keeping characters the same.
- It can connect two videos, filling the in-between frames automatically.

The wildest claim OpenAI has (and we can see the hints) is that Sora understands the world through videos. It understands 3D motion, behaviour of objects and complex interactions (not perfect though). This all leads to creating a model that simulates our world to the best extent we can.

But how is OpenAI’s model this good, across all this stuff? In their own words: they are purely phenomena of scale. The best explanation of that is this demo with base compute, 4x and 16x compute.

Why should I care?

AI just got serious about filmmaking and it’s gonna kill the video industry.

Just kidding, we don’t cry wolf around here. But in all honesty, get ready for a wave of AI-generated video hitting the web. Sora (and upcoming models) will not only make long, visually consistent videos, but they’ll also handle complex prompts with characters, emotions, and multiple scene changes.

Seeing is... not always believing. At least not anymore. It'll get harder to tell what's real footage and what's been cooked up by an AI. Limitations exist. Physics can be wacky in these videos, and Sora might misinterpret some directions. Don't throw out your special effects team just yet, but take note.

Reply

or to participate.