Why I'm not excited about Sora (OpenAI's Video Generator)
Sora/DallE are impressive but ChatGPT is exciting
OpenAI just announced Sora, a Gen AI that can generate videos from a text description, and it is mindblowing. Before reading the rest of this article, check out its capabilities in the video below, or here, here, or here.
It is truly impressive.
But, I’m still more interested in the LLMs (ChatGPT, Bard/Gemini, Claude).
Why?
There are two reasons to create content. Let me call it poetry mode and programming mode (or we could call it entertainment mode and business mode). Consider this poem on a difficult topic that I made ChatGPT write: you can’t call it “wrong”. You might not like this poem but your neighbour might like it and in general, whether ChatGPT successfully accomplished what it was told to do is 1) a subjective decision, and 2) a large number of people would find it acceptable.
By contrast, consider this program I asked it to write. This is a fairly easy question, one we ask freshers to write during interviews. (Real-life programming tasks can be much harder than this.) But ChatGPT flubs it badly: if you run the program, you’ll see that it does not produce the correct answer even for the example given in the question. This is not an acceptable response to the task that ChatGPT was given.
I am interested in how Gen AI will affect our day-to-day lives and our work. From that perspective, I don’t find the entertainment mode of Gen AI programs particularly interesting. The business mode is where all the interesting action is.
You’ll notice that most of my past posts here have been primarily about ChatGPT and its cousins (Bard/Gemini, Bing/Copilot, Claude)—the LLMs (large language models). By contrast, I haven’t spent that much time on Dall-E/Midjourney. The reason is simple: As far as I can tell, LLMs are the only ones which have a business mode. Dall-E/Midjourney only have an entertainment mode.
What does that mean? Dall-E can generate a brilliant photograph of a dog using a laptop, but except for a small number of people in entertainment, advertising, and design, this is not a very useful capability.
Try to do any image generation related to actual business/work use cases and you’ll find that DallE/Midjourney etc fail badly. This includes tasks like:
Create a block diagram of the architecture of our offering
Give me a copy of this image with part X circled and labelled
Give me an image of this org chart1
And many others. You’ll notice what all these tasks have in common: accuracy and consistency are important, and in each case, there is a clear “correct” answer. Mistakes are not acceptable, and DallE/Midjourney are almost guaranteed to make those mistakes. In short, DallE/Midjourney don’t have a “program mode” (so far).
Another problem is this: if through a lot of effort, you manage to produce an image that is 90% correct, there is no easy way for you to to fix the remaining 10%. If you ask it to make corrections, you’ll end up with a completely different image with new errors in the 90% that was correct. LLMs, by contrast, are much better at this. In the programming example I gave earlier, it is possible to point out the errors in the generated program and scold it, and you might be able to get a working program out of it. In the worst case, you as a programmer can fix the program yourself. In case of the image, however, even fixing it yourself isn’t easy because the image generated is a PNG, not an Illustrator/Photoshop input file with layers that are easy to manipulate.
Which brings me to Sora and the videos. Great entertainment, but limited value in business mode. Consistency is not a strong point. If you look closely at the Sora video of the woman walking on a Tokyo street, you’ll notice a bunch of inconsistencies (the front of her jacket changes, her legs flip, in general consistency is not maintained for more than a few seconds. Acceptable for entertainment mode, not useful for business mode.
Of course, things can change rapidly, so I’ll keep an eye out for what new capabilities the image, video and other multimedia Gen AIs are acquiring. But for now, I’m more interested in the LLMs and excited to try out Gemini 1.5—less than a week after I complained that I wasn’t impressed with Gemini Ultra, Google has announced the next version of Gemini, with some exciting capabilities.
It is possible to get ChatGPT to produce some of these, for example, an org chart. But the way ChatGPT does that is to write a program to produce the org chart and then give you the output of that program. This is an LLM (ChatGPT/text) capability, not an image generator (DallE/Midjourney) capability.
As one data point for image vs text. I was looking at an image of a graph and I wanted ChatGPT to understand the graph and label certain portions. It failed badly at it, it could describe the image but not correctly manipulate it that gave acceptable results. But when I downloaded the underlying data and asked it to write a program to plot modified graph with desired labels It was able to do so. So even for the same task the text route was better than image route.
I wonder why the difference in image vs text capabilities. Is it the that the text training data was richer? did us humans took more effort in produce higher quality text data or is it more to do with the difference in capabilities of LLM vs diffusion model
"Try to do any image generation related to actual business/work use cases and you’ll find that DallE/Midjourney etc fail badly. " - I had tried creating specific comic strips for my blog using DallE and Midjourney using various prompts, but had failed miserably.