Veo 2, Google’s AI video generator announced in December, now
has a price that is attracting attention.
The 50 cents per second of video adds up to $30 per minute. The price tag will likely make video ads and content across the web -- as well
as advertising on connected TV (CTV) -- more affordable for smaller companies. The company posted a price page online.
“[A] very important number to keep in mind when considering the future of generative and non-generative media,” Jon Barron, an AI researcher at Google DeepMind, wrote in an X post.
In comparison, Barron wrote that the blockbuster "Avengers: Endgame" cost around $32,000 per second to produce when using traditional methods.
advertisement
advertisement
Barron agreed with a comment from @mahaoo_ASI, a follower on X, that his observation wasn’t completely an apples-to-apples comparison “because you'd probably need to generate hundreds of generations before you get what you want, but it's still a 1000x difference, and in a few years, it might be comparable in terms of capabilities (consistent characters/scenes)”
OpenAI recently made Sora, a video generation model available to subscribers for $200 a month through a ChatGPT Pro subscription.
There’s other use for video. Meta and others have begun training AI models using publicly available video to help it understand the world.
Meta publicly released the Video Joint Embedding Predictive Architecture (V-JEPA) model, a step in advancing machine intelligence. The early example of a physical world model can detect and understand highly detailed interactions between objects.
It was released under a Creative Commons non-commercial license for researchers to further explore.
V-JEPA was trained on two million public videos. Meta said it achieves performance on motion and appearance-based tasks without fine-tuning, and can outperform other methods.
It required training a foundation model for object-centric learning using video data. A neural network extracted object-centric representations from video frames, capturing motion and appearance cues.
Then the images are refined through contrastive learning to enhance objects. The architecture processes these representations to model object interactions over time. The framework is trained on a large-scale dataset, optimizing for reconstruction, accuracy and consistency of the objects across frames.