DeepMind, Google’s AI research arm, has launched Genie 2, an advanced model designed to generate an unlimited array of playable 3D environments.
Building on its predecessor, Genie, released earlier this year, Genie 2 transforms a single image and text description—such as “A cute humanoid robot in the woods”—into interactive, real-time 3D scenes. This innovation places it alongside similar efforts by World Labs, led by Fei-Fei Li, and Israeli startup Decart.
DeepMind highlights Genie 2’s ability to create a diverse range of rich, immersive 3D worlds. These environments support user interactions like jumping and swimming, controlled via mouse or keyboard. Trained on extensive video datasets, the model can simulate realistic object interactions, animations, lighting, physics, reflections, and lifelike behaviors of non-playable characters (NPCs).
Image Credits: DeepMind
Many of Genie 2’s simulations resemble the high-quality graphics of AAA video games, likely due to the inclusion of playthroughs of popular titles in its training data. However, DeepMind, like many AI research organizations, has kept details about its data sourcing largely under wraps, citing competitive or other considerations.
This raises questions about intellectual property. As a subsidiary of Google, DeepMind has extensive access to YouTube, and Google’s terms of service suggest it has permission to use YouTube videos for training purposes, meaning it could possibly be trained on let’s play videos and playthrough’s. Although the next question is whether that’s enough to inform an AI model about creating 3D worlds, and if so, will we see the kind of artifacting and picture in picture elements often found in Let’s play videos.
DeepMind also claims that Genie 2 can create consistent 3D worlds with varying perspectives, such as first-person and isometric views, for up to a minute. Most simulations, however, typically last between 10 and 20 seconds.
“Genie 2 intelligently responds to keyboard actions, recognizing the appropriate character to move,” DeepMind explained in a blog post. “For instance, the model understands that arrow keys should move a robot rather than trees or clouds.”
Most world models, like Genie 2, can simulate games and 3D environments but often struggle with issues like artifacts, inconsistency, and hallucinations. For instance, Decart’s Minecraft simulator, Oasis, operates at a low resolution and often “forgets” the layout of levels as users progress.
Genie 2, however, stands out for its ability to retain details of a simulated scene, even those not currently in view, and accurately render them when they reappear. (This capability is shared by models from World Labs as well.)
That said, games built using Genie 2 wouldn’t provide much entertainment, as the model tends to reset progress after a minute. For this reason, DeepMind envisions Genie 2 as a research and creative tool—ideal for prototyping “interactive experiences” and testing AI agents.
Image Credits: DeepMind
Genie 2’s ability to generalize beyond its training data enables it to transform concept art and sketches into fully interactive environments,” DeepMind noted. “By using Genie 2 to rapidly generate diverse and detailed environments, our researchers can create evaluation tasks for AI agents that are entirely new to them.”
This potential has sparked interest but also concern, particularly in the video game industry. As highlighted in a recent Wired investigation, major companies like Activision Blizzard are increasingly adopting AI to boost productivity, streamline development, and mitigate workforce reductions—raising questions about the role of AI in creative fields.
Nonetheless, Google continues to invest heavily in world model research, which is poised to be a transformative area in AI. In October, DeepMind recruited Tim Brooks, previously leading OpenAI’s Sora video generator project, to focus on video generation and world simulators. Additionally, two years ago, the lab brought in Tim Rocktäschel—known for his groundbreaking work on “open-mindedness” in games like NetHack—from Meta