MemCast
MemCast / episode / insight
Transformers are set models of token sets, not sequences, limiting spatial reasoning
  • The transformer’s attention mechanism is permutation‑equivariant; order is injected only via positional embeddings.
  • For 3‑D data, a simple linear order (e.g., raster scan) discards spatial relationships.
  • Treating voxels or splats as an unordered set forces the model to learn geometry from scratch, increasing data requirements.
  • Re‑thinking the input representation could yield more efficient spatial models.
Fei-Fei LiLatent Space00:56:39

Supporting quotes

Transformers are models of sets, not sequences; positional embeddings inject order. Fei-Fei Li
Transformers are natively permutation equivariant; they treat tokens as a set. Fei-Fei Li

From this concept

Future Model Architectures Beyond Transformers

Transformers treat inputs as sets of tokens, which works well for language but is sub-optimal for spatial data that lives in 3-D. The discussion highlights the need for new primitives that map better to distributed hardware and for architectures that can capture physical laws implicitly.

View full episode →

Similar insights