After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

Fei‑Fei Li and Justin Johnson explain how exploding compute, open‑science datasets, and new 3‑D data structures are powering the next generation of world models and spatial intelligence.

1h 0m·Guest Fei-Fei Li and Justin Johnson·Host Allesio·

Ddeepu.kalidindi 🤖AI & Technology 🔬Science & Engineering· added 3 days ago

Compute Scaling as the Engine of Progress

1 / 9

The guests argue that every major leap in AI has been driven by orders-of-magnitude increases in compute. The exponential growth in GPU performance and the ability to train on thousands of devices unlocks the data-hungry spatial models that were impossible a decade ago.

#capital-scaling7

Compute scaling fuels the leap from ImageNet to world models

The early days of deep learning were defined by moving from CPUs to GPUs, a transition epitomised by AlexNet.
Today a single GPU is roughly a thousand times faster than the GPUs used for AlexNet, allowing far larger models.
This raw compute boost is the primary reason we can now train generative 3‑D world models that were unimaginable in the early 2010s.
Without this scaling, spatial intelligence would remain a research curiosity rather than a deployable product.

Quote

00:00:00

“I think the whole history of deep learning is in some sense the history of scaling up compute.” — Fei-Fei Li

Quote

00:04:13

“If you think about AlexNet required this jump from CPUs to GPUs but even from AlexNet to today we're getting about a thousand times more performance per card.” — Fei-Fei Li

Million‑fold compute increase enables massive spatial models

Modern training clusters can marshal a million‑fold more FLOPs than were available at the start of Fei‑Fei’s PhD.
This scale makes it feasible to train models that ingest and reason over billions of pixels of visual data.
The sheer compute budget also lets researchers experiment with richer loss functions that capture geometry, physics, and semantics simultaneously.
As a result, world‑scale generative models like Marble become practical rather than speculative.

Quote

00:04:28

“The amount of compute that we can marshal today on a single model is about a millionfold more than we could have even at the start of my PhD.” — Fei-Fei Li

Quote

00:13:02

“Performance per watt from Hopper to Blackwell shows scaling limits; we need new architectures.” — Fei-Fei Li

GPU performance per watt is hitting limits, prompting new hardware primitives

Even as transistor counts grow, the performance‑per‑watt curve has plateaued between Hopper and Blackwell generations.
Relying solely on matrix‑multiplication‑centric GPUs will soon become a bottleneck for distributed 3‑D workloads.
The guests advocate exploring primitives beyond dense matrix ops, such as graph‑based or sparsity‑aware kernels that map better to spatial data.
Early research into these alternatives could unlock the next wave of efficient world‑model training.

Quote

00:11:35

“We need wacky ideas… hardware scaling will not be infinite, we need new primitives beyond matrix multiplication.” — Fei-Fei Li

Quote

00:03:32

“If you think about AlexNet, the core pieces of it were obviously ImageNet, it was the move to GPUs and neural networks.” — Fei-Fei Li

Open Science vs. Commercial Incentives

2 / 9

World Labs balances an open-science ethos--publishing datasets and benchmarks--with the commercial realities of a startup. While academia struggles for resources, industry can move faster but often keeps breakthroughs behind closed doors. The tension shapes how world-model research is funded and shared.

#funded-trading3

Open datasets like Behavior keep academic research alive

Stanford’s recent Behavior benchmark provides a simulated environment for robotic learning, lowering the entry barrier for academia.
By releasing both data and evaluation protocols, the community can iterate on world‑model algorithms without needing massive compute clusters.
Open benchmarks also create a common yardstick, making progress measurable across labs.
The initiative demonstrates that open‑science can thrive alongside commercial products.

Quote

00:05:03

“Open science still important… we announced an open dataset and benchmark called behavior for benchmarking robotic learning in simulated environments.” — Fei-Fei Li

Quote

00:08:05

“We work with the first Trump administration on a bill called national AI research resource (NAR) which is scoping out a national AI compute cloud as well as a data repository.” — Fei-Fei Li

Industry's focused work often stays private, creating a mixed ecosystem

Many industrial labs develop world‑model components that never appear in public challenges, opting instead for proprietary product roadmaps.
This private progress fuels rapid iteration but limits external validation and community learning.
The ecosystem therefore consists of a public‑open layer (datasets, benchmarks) and a private‑closed layer (productised models).
Understanding this split helps researchers navigate collaborations and funding opportunities.

Quote

00:05:47

“The ecosystem is a mixture… a lot of the very focused work in industry… seeing the daylight in the form of a rather than an open challenge per se.” — Fei-Fei Li

Quote

00:06:09

“It's just a matter of the funding in the business model, you have to see some ROI.” — Fei-Fei Li

Funding pressures push labs toward proprietary challenges, but open science remains vital

Academic groups are severely under‑resourced, limiting their ability to chase large‑scale experiments.
Start‑ups, needing to show ROI to investors, often prioritize closed‑source product development.
Despite this, the founders stress that open datasets and benchmarks are essential to keep the research pipeline healthy.
The balance between commercial incentives and open collaboration will shape the next decade of world‑model progress.

Quote

00:11:10

“Academia is severely under‑resourced… researchers and students do not have enough resources to try these ideas.” — Fei-Fei Li

Quote

00:00:08

“When I graduated from grad school, I really thought the rest of my entire career would be towards solving that single problem…” — Fei-Fei Li

Spatial Intelligence vs. Language Intelligence

3 / 9

The conversation distinguishes spatial reasoning--understanding, moving, and interacting in 3-D space--from linguistic processing. Humans are born with high-bandwidth visual perception, while language is a comparatively low-bandwidth, symbolic channel. Building AI that matches human spatial acuity requires models that go beyond token-by-token prediction.

#spatial-intelligence4 #cognitive-biases2

Spatial intelligence complements linguistic intelligence for 3‑D reasoning

Spatial intelligence lets an agent infer geometry, affordances, and physical constraints directly from visual input.
Linguistic intelligence excels at abstract reasoning, planning, and knowledge retrieval.
Combining both yields systems that can understand a scene (spatial) and explain it (language).
World Labs’ Marble exemplifies this hybrid approach by taking textual prompts and producing editable 3‑D worlds.

Quote

00:43:48

“Spatial intelligence is the capability that allows you to reason, understand, move and interact in space. It is complementary to linguistic intelligence.” — Fei-Fei Li

Quote

00:42:54

“The AI field is inspired by human intelligence, which is multi‑intelligent, including spatial.” — Fei-Fei Li

Human perception is inherently spatial; language is a lossy channel

Newborns already possess visual acuity and the ability to coordinate hand‑eye actions, whereas language acquisition takes years of training.
The visual system processes a massive stream of pixel data, far richer than the few hundred thousand spoken tokens generated in a day.
Translating rich 3‑D experience into discrete tokens inevitably discards geometry, depth, and physical cues.
Therefore, relying solely on language models limits an AI’s ability to act in the real world.

Quote

00:47:33

“We are born with visual perception and spatial abilities; language acquisition is harder.” — Fei-Fei Li

Quote

00:45:10

“The token count of speaking all day is about 215,000 tokens, far less than the bandwidth of our world.” — Fei-Fei Li

Moving AI out of the data‑center into the world requires spatial models

The next decade of computer vision will shift from training inside massive data‑centers to deploying models that act directly in physical environments.
This transition demands models that understand geometry, dynamics, and affordances, not just static image classification.
Spatial intelligence therefore becomes the missing piece for embodied agents, robotics, and AR/VR applications.
World Labs’ focus on world models is a direct response to this emerging need.

Quote

00:03:07

“The next decade of computer vision will be about getting AI out of the data centre and out into the world.” — Fei-Fei Li

Quote

00:03:32

“If you think about AlexNet, the core pieces of it were obviously ImageNet, it was the move to GPUs and neural networks.” — Fei-Fei Li

Marble: A Generative 3-D World Model and Product

4 / 9

Marble is World Labs' flagship system that turns text, images, or multiple images into editable 3-D scenes. It is deliberately built as a usable product today while also serving as a research stepping-stone toward full-scale spatial intelligence. Real-time demos prove the feasibility of streaming 3-D generation over the internet.

#3d-ai13 #marketing‑strategy5

Marble ingests multimodal inputs and produces editable 3‑D worlds

Users can feed a single text prompt, a single image, or a set of images and receive a coherent 3‑D scene that matches the inputs.
The system supports interactive edits such as recoloring objects, moving items, or adding new geometry.
This multimodal flexibility bridges the gap between creative designers and technical engineers.
By exposing a simple API, Marble enables downstream applications ranging from game asset creation to rapid prototyping.

Quote

00:00:28

“Marble is a generative model of 3‑D worlds… you can input things like text or image or multiple images and it will generate for you a 3‑D world that kind of matches those inputs.” — Fei-Fei Li

Quote

00:31:45

“Marble is the first glimpse into our model… it is a model of spatial intelligence that also intentionally designed to be useful today.” — Fei-Fei Li

Product focus balances immediate utility with long‑term world‑model ambition

While Marble showcases cutting‑edge research, the team deliberately built it to solve real problems for creators in gaming, VFX, and film.
This dual‑track approach ensures revenue streams that fund further research into grander world models.
The UI emphasizes precise camera control and instant feedback, differentiating it from frame‑by‑frame video generators.
By shipping a usable product early, World Labs gathers user data that informs the next generation of spatial AI.

Quote

00:00:50

“We are seeing emerging use cases in gaming, in VFX, in film where I think there's a lot of really interesting stuff that Marble can do today as a product.” — Fei-Fei Li

Quote

00:33:23

“Precise camera control requires understanding 3‑D space; Marble enables that unlike other video generative models.” — Fei-Fei Li

Real‑time demos showcase feasibility despite latency, proving concept

A live demo streamed a 3‑D generation from a server in California to a conference in Santiago, Chile, achieving a functional (though low‑fps) experience.
The demo proved that even with network latency, a responsive 3‑D generation pipeline is possible.
Such proof‑of‑concepts build confidence for investors and early adopters.
Ongoing UI improvements, like the “advanced mode,” aim to reduce friction and increase interactivity for end‑users.

Quote

00:20:20

“We built a real‑time demo streaming from California to Santiago with 1 FPS; the fact that it worked at all was pretty amazing.” — Fei-Fei Li

Quote

00:59:42

“Marble's advanced mode editing allows detailed changes; UI improvements needed.” — Fei-Fei Li

Data Structures for 3-D Worlds

5 / 9

World Labs experiments with several atomic representations for 3-D content: Gaussian splats, frame-by-frame video tokens, and tokenised 3-D chunks. The choice of primitive determines rendering speed, editability, and scalability. Gaussian splats currently offer the best real-time performance on mobile devices.

Gaussian splats are the atomic rendering unit enabling real‑time on devices

Each splat is a tiny semi‑transparent particle with a 3‑D position and orientation.
Because splats can be rasterised efficiently, they run at interactive frame rates on smartphones and browsers.
The scene is built by compositing millions of splats, allowing fine‑grained geometry without heavy mesh processing.
This representation underpins Marble’s ability to provide instant camera control and live editing.

Quote

00:34:36

“Gaussian splats are tiny semi‑transparent particles with position and orientation; they can be rendered in real time on any device.” — Fei-Fei Li

Quote

00:38:18

“Splat density is limited by device compute; newer devices allow more splats and higher resolution.” — Fei-Fei Li

Alternative representations like frames or token streams are being explored

The RTFM model generates one frame at a time as the user interacts, offering a video‑like output.
Token‑based 3‑D representations could treat chunks of space as discrete symbols, similar to language tokens.
These alternatives may improve compatibility with existing transformer pipelines but could sacrifice real‑time performance.
Ongoing research evaluates trade‑offs between latency, fidelity, and editability for each primitive.

Quote

00:35:13

“The model's architecture could be frames, tokens, or splats; we are experimenting.” — Fei-Fei Li

Quote

00:35:13

“The model can generate frames one at a time as the user interacts (RTFM model).” — Fei-Fei Li

Choosing the right primitive impacts scalability and editability

Frame‑based outputs are straightforward for video pipelines but make per‑object editing cumbersome.
Token‑based schemes align with language models but require careful design of positional embeddings for 3‑D space.
Gaussian splats provide a sweet spot: they are lightweight enough for mobile rendering while preserving per‑object granularity.
Future work may combine primitives, using splats for geometry and tokens for high‑level semantics.

Quote

00:56:02

“Sequence‑to‑sequence may still be useful but we shouldn't discard what works.” — Fei-Fei Li

Quote

00:03:32

“If you think about AlexNet, the core pieces of it were obviously ImageNet, it was the move to GPUs and neural networks.” — Fei-Fei Li

Integrating Physics into World Models

6 / 9

Adding dynamics and force reasoning to generative 3-D models is essential for applications like architecture and robotics. The team explores two pathways: attaching physical properties directly to splats and distilling classical physics engine simulations into neural weights. Accurate physics remains a hard problem, especially when models must generalise to unseen forces.

#simulation3 #dynamics2

Attaching mass and spring properties to splats enables physics simulation

Each splat can be enriched with a mass value and linked to neighbouring splats via virtual springs.
This lightweight physics layer allows simple dynamics such as gravity, collisions, and elastic deformation without a full engine.
Because the physics is baked into the same data structure used for rendering, updates remain real‑time.
Early prototypes show plausible object interactions, opening doors for interactive design tools.

Quote

00:36:25

“Attaching mass and spring properties to splats enables physics simulation.” — Fei-Fei Li

Quote

00:29:08

“We can use classical physics engines to generate data for training, then distill into neural network weights.” — Fei-Fei Li

Hybrid pipelines can distill classical physics engine data into neural weights

Traditional simulators produce high‑fidelity trajectories and force fields that can be rendered as training targets.
By training a neural network on this synthetic data, the model learns an implicit physics prior without explicit engine code.
This approach leverages existing physics research while keeping inference fast and differentiable.
It also sidesteps the need to hand‑craft physical parameters for every new object class.

Quote

00:29:08

“We can use classical physics engines to generate data for training, then distill into neural network weights.” — Fei-Fei Li

Quote

00:29:43

“The AI field has a history of using GPUs originally for graphics then repurposed for AI.” — Fei-Fei Li

Physics integration remains a challenge for accurate architectural design

Current Marble generations produce visually plausible structures but do not guarantee structural stability.
Architects need models that respect material strength, load‑bearing constraints, and real‑world construction tolerances.
Embedding a full physics solver inside the generative loop would dramatically increase compute requirements.
Ongoing research aims to balance visual fidelity with physically correct predictions, possibly via multi‑stage pipelines.

Quote

00:25:02

“The model can generate 3‑D worlds that are plausible for architectural design, but may not capture physical forces.” — Fei-Fei Li

Quote

00:24:57

“But does the model actually understand how the arch is actually drawing on the centre kind of stone and the actual physical structure of it?” — Fei-Fei Li

Synthetic Data for Embodied AI and Robotics

7 / 9

Robotics suffers from a lack of high-quality, diverse training data. Marble can synthesize realistic 3-D environments, providing a middle ground between scarce real-world recordings and uncontrolled internet video. By generating controllable scenarios, researchers can train agents that transfer to real robots more effectively.

#automation8 #simulation3

Robotics suffers from data starvation; high‑fidelity synthetic worlds fill the gap

Real‑world robotic datasets are expensive to collect and often lack the diversity needed for robust policies.
Marble’s ability to render photorealistic scenes on demand creates virtually unlimited training material.
Synthetic environments can be annotated automatically for pose, segmentation, and physics properties.
Early collaborations show that agents trained on Marble‑generated data outperform those trained on raw internet video.

Quote

00:39:18

“Robotic training lacks high‑fidelity real‑world data; synthetic worlds from Marble can fill that gap.” — Fei-Fei Li

Quote

00:33:03

“The product can be used for gaming, VFX, film, and eventually robotics.” — Fei-Fei Li

Marble can generate diverse scenarios for training embodied agents

Because Marble accepts arbitrary text or image prompts, it can synthesize a wide range of room layouts, object configurations, and lighting conditions.
The generated scenes are fully controllable, allowing systematic curriculum learning (e.g., gradually increasing clutter).
Users can also retrieve the underlying 3‑D representation for downstream physics simulation.
This flexibility accelerates research on navigation, manipulation, and multi‑modal perception.

Quote

00:50:12

“Marble is a multimodal model; it takes language as input.” — Fei-Fei Li

Quote

00:18:30

“We built a dense captioning system that draws boxes around objects and writes short snippets for each region.” — Fei-Fei Li

Balancing realism with controllability is key for effective simulation

Purely photorealistic renderings may hide underlying physical parameters, making it hard to extract forces.
Conversely, overly abstract simulations lack the visual richness needed for perception training.
Marble’s architecture lets developers toggle fidelity levels, choosing higher‑resolution splats for perception and lower‑resolution physics‑ready meshes for dynamics.
This trade‑off strategy yields datasets that are both visually convincing and analytically useful.

Quote

00:41:30

“There are emergent use cases that just fall out of the model without being specifically built.” — Fei-Fei Li

Quote

00:58:31

“The AI community is moving from language models to spatial intelligence as a new frontier.” — Fei-Fei Li

Future Model Architectures Beyond Transformers

8 / 9

Transformers treat inputs as sets of tokens, which works well for language but is sub-optimal for spatial data that lives in 3-D. The discussion highlights the need for new primitives that map better to distributed hardware and for architectures that can capture physical laws implicitly.

#distributed-compute2 #model-architectures2

Transformers are set models of token sets, not sequences, limiting spatial reasoning

The transformer’s attention mechanism is permutation‑equivariant; order is injected only via positional embeddings.
For 3‑D data, a simple linear order (e.g., raster scan) discards spatial relationships.
Treating voxels or splats as an unordered set forces the model to learn geometry from scratch, increasing data requirements.
Re‑thinking the input representation could yield more efficient spatial models.

Quote

00:56:39

“Transformers are models of sets, not sequences; positional embeddings inject order.” — Fei-Fei Li

Quote

00:57:03

“Transformers are natively permutation equivariant; they treat tokens as a set.” — Fei-Fei Li

New primitives beyond matrix multiplication are needed for distributed hardware

Current GPUs excel at dense matrix ops, but future hardware may favour sparse or graph‑based kernels.
Spatial data naturally forms graphs (e.g., splat neighbourhoods), suggesting graph neural networks or message‑passing schemes as alternatives.
Designing primitives that map directly onto clusters of devices could avoid the monolithic GPU bottleneck.
Early research at World Labs explores such kernels to better exploit the upcoming wave of specialized AI accelerators.

Quote

00:11:35

“We need wacky ideas… hardware scaling will not be infinite, we need new primitives beyond matrix multiplication.” — Fei-Fei Li

Quote

00:13:02

“Performance per watt from Hopper to Blackwell shows scaling limits; we need new architectures.” — Fei-Fei Li

Emergent capabilities require architectural innovation, not just scale

Larger models have shown surprising abilities (e.g., zero‑shot physics reasoning), but these are not guaranteed to appear with scale alone.
Introducing inductive biases—such as explicit geometry or physics modules—may accelerate the emergence of useful behaviours.
The team observes that certain capabilities appear only when the architecture aligns with the data modality.
Therefore, research must pursue both scaling laws and novel model designs to unlock true spatial intelligence.

Quote

00:28:11

“Scaling may bring emergent physics understanding; larger models show emergent capabilities.” — Fei-Fei Li

Quote

00:56:02

“Sequence‑to‑sequence may still be useful but we shouldn't discard what works.” — Fei-Fei Li

Talent and Ecosystem for Spatial AI

9 / 9

World Labs actively recruits researchers, engineers, and product people, emphasizing interdisciplinary expertise in graphics, physics, and AI. The founders view open challenges and a clear product vision as magnets for top talent, while also highlighting the need for public-sector resources to sustain academic breakthroughs.

#ecosystem4 #talent-identification3

World Labs seeks deep researchers, engineers, and product thinkers to build spatial AI

The hiring push targets experts in large‑scale model training, real‑time rendering, and productisation.
Engineers are needed to optimise training pipelines across thousands of GPUs.
Product designers will translate research breakthroughs into usable tools for creators and enterprises.
By advertising both the scientific challenge and immediate product impact, the company hopes to attract a broad talent pool.

Quote

00:58:31

“We are hungry for talent: deep researchers, engineers, product thinkers.” — Fei-Fei Li

Quote

00:58:31

“The AI community is moving from language models to spatial intelligence as a new frontier.” — Fei-Fei Li

Interdisciplinary expertise in physics, graphics, and AI is crucial

Understanding physical forces enables more realistic world generation.
Graphics knowledge ensures efficient rendering pipelines (e.g., splats, shaders).
AI research provides the learning algorithms that tie perception, generation, and interaction together.
World Labs’ staff composition reflects this blend, with hires from academia, industry, and graphics‑focused startups.

Quote

00:36:25

“Attaching mass and spring properties to splats enables physics simulation.” — Fei-Fei Li

Quote

00:29:43

“The AI field has a history of using GPUs originally for graphics then repurposed for AI.” — Fei-Fei Li

Open challenges and product roadmaps attract a diverse talent pool

Publishing datasets like Behavior signals a commitment to open research, which appeals to academic‑oriented engineers.
Simultaneously, the clear product vision (e.g., Marble for VFX, gaming, interior design) gives engineers a tangible impact.
The dual focus creates career paths ranging from pure research to product engineering.
This strategy helps World Labs compete with larger AI labs for top talent.

Quote

00:05:03

“Open science still important… we announced an open dataset and benchmark called behavior.” — Fei-Fei Li

Quote

00:59:42

“Marble's advanced mode editing allows detailed changes; UI improvements needed.” — Fei-Fei Li

⚙ Agent-readable JSON index — click to expand

{
  "memcast_version": "0.1",
  "episode":  {
    "id": "60iW8FZ7MJU",
    "title": "After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs",
    "podcast": "Latent Space",
    "guest": "Fei-Fei Li and Justin Johnson",
    "host": "Allesio",
    "source_url": "https://www.youtube.com/watch?v=60iW8FZ7MJU",
    "duration_minutes": 61
  },
  "concepts":  [
    {
      "id": "compute-scaling-as-the-engine-of-progress",
      "title": "Compute Scaling as the Engine of Progress",
      "tags":  [
        "capital-scaling"
      ]
    },
    {
      "id": "open-science-vs-commercial-incentives",
      "title": "Open Science vs. Commercial Incentives",
      "tags":  [
        "funded-trading"
      ]
    },
    {
      "id": "spatial-intelligence-vs-language-intelligence",
      "title": "Spatial Intelligence vs. Language Intelligence",
      "tags":  [
        "spatial-intelligence",
        "cognitive-biases"
      ]
    },
    {
      "id": "marble-a-generative-3-d-world-model-and-product",
      "title": "Marble: A Generative 3-D World Model and Product",
      "tags":  [
        "3d-ai",
        "marketing‑strategy"
      ]
    },
    {
      "id": "data-structures-for-3-d-worlds",
      "title": "Data Structures for 3-D Worlds",
      "tags":  []
    },
    {
      "id": "integrating-physics-into-world-models",
      "title": "Integrating Physics into World Models",
      "tags":  [
        "simulation",
        "dynamics"
      ]
    },
    {
      "id": "synthetic-data-for-embodied-ai-and-robotics",
      "title": "Synthetic Data for Embodied AI and Robotics",
      "tags":  [
        "automation",
        "simulation"
      ]
    },
    {
      "id": "future-model-architectures-beyond-transformers",
      "title": "Future Model Architectures Beyond Transformers",
      "tags":  [
        "distributed-compute",
        "model-architectures"
      ]
    },
    {
      "id": "talent-and-ecosystem-for-spatial-ai",
      "title": "Talent and Ecosystem for Spatial AI",
      "tags":  [
        "ecosystem",
        "talent-identification"
      ]
    }
  ]
}