MemCast
MemCast / episode / insight
Human perception is inherently spatial; language is a lossy channel
  • Newborns already possess visual acuity and the ability to coordinate hand‑eye actions, whereas language acquisition takes years of training.
  • The visual system processes a massive stream of pixel data, far richer than the few hundred thousand spoken tokens generated in a day.
  • Translating rich 3‑D experience into discrete tokens inevitably discards geometry, depth, and physical cues.
  • Therefore, relying solely on language models limits an AI’s ability to act in the real world.
Fei-Fei LiLatent Space00:47:33

Supporting quotes

We are born with visual perception and spatial abilities; language acquisition is harder. Fei-Fei Li
The token count of speaking all day is about 215,000 tokens, far less than the bandwidth of our world. Fei-Fei Li

From this concept

Spatial Intelligence vs. Language Intelligence

The conversation distinguishes spatial reasoning--understanding, moving, and interacting in 3-D space--from linguistic processing. Humans are born with high-bandwidth visual perception, while language is a comparatively low-bandwidth, symbolic channel. Building AI that matches human spatial acuity requires models that go beyond token-by-token prediction.

View full episode →

Similar insights