Google DeepMind to Combine Gemini AI and Veo

Google is quietly laying the groundwork for a major evolution in artificial intelligence. Combining its powerful Gemini AI model with Veo, its video generation system, to build a smarter, more physically aware digital assistant. Speaking on the Possible podcast, hosted by LinkedIn co-founder Reid Hoffman, Google DeepMind CEO Demis Hassabis revealed this long-term vision. He said the ultimate goal is to create a universal AI assistant. One that can truly understand and interact with the physical world.

“We’ve always built Gemini to be multimodal from the start,” Hassabis said. “That’s because we envision an assistant that helps in the real world. Not just in text or online searches, but with real-world tasks and interactions.”

This aligns with a broader trend across the AI industry. The next frontier is “omnimodal” or “any-to-any” models—AI systems capable of understanding and generating content across multiple media formats like text, audio, video, and images.

Gemini, Google’s flagship AI model, can already generate text, audio, and images. Now, by integrating it with Veo—Google’s advanced video-generating AI. It could gain deeper insights into the laws of physics, movement, and visual storytelling. That extra layer of understanding is what Hassabis believes will make future assistants much more capable in the physical world.

Where will all this training data come from? Likely YouTube.

According to Hassabis, Veo has been “watching a lot of YouTube videos” to learn about how the world works. Everything from how people move to how physical objects behave. While he didn’t explicitly confirm the full extent of this training. He hinted strongly that Google’s video platform is a key data source.

This isn’t surprising, considering that Google has access to one of the largest video libraries in the world. And as previously reported, the company updated its terms of service last year—reportedly in part to give itself broader rights to use YouTube content for AI training.

Google had earlier told TechCrunch its models “may be” trained on “some” YouTube content, depending on creator agreements. But as the demand for rich multimodal data continues to grow, the use of YouTube footage could become increasingly central to training more advanced models like Veo and Gemini.

While OpenAI’s ChatGPT can now generate images and even stylized art, and Amazon plans to roll out its own “any-to-any” model later this year, Google seems to be aiming for something deeper—an AI that doesn’t just replicate media, but truly understands it.

By combining Veo’s visual comprehension with Gemini’s language and reasoning capabilities, Google is positioning itself to lead the charge toward AI assistants that are not only helpful online—but useful in the real world too.

Share with others