Transformers are set models of token sets, not sequences, li...

MemCast / episode / insight

#distributed-compute2 #model-architectures2 #transformers1

Transformers are set models of token sets, not sequences, limiting spatial reasoning

The transformer’s attention mechanism is permutation‑equivariant; order is injected only via positional embeddings.
For 3‑D data, a simple linear order (e.g., raster scan) discards spatial relationships.
Treating voxels or splats as an unordered set forces the model to learn geometry from scratch, increasing data requirements.
Re‑thinking the input representation could yield more efficient spatial models.

Fei-Fei LiLatent Space00:56:39

Supporting quotes

Quote

00:56:39

“Transformers are models of sets, not sequences; positional embeddings inject order.” — Fei-Fei Li

Quote

00:57:03

“Transformers are natively permutation equivariant; they treat tokens as a set.” — Fei-Fei Li

From this concept

Future Model Architectures Beyond Transformers

Transformers treat inputs as sets of tokens, which works well for language but is sub-optimal for spatial data that lives in 3-D. The discussion highlights the need for new primitives that map better to distributed hardware and for architectures that can capture physical laws implicitly.

View full episode →

Similar insights

“New primitives beyond matrix multiplication are needed for distributed hardware”

Fei-Fei LiLatent Space

“Emergent capabilities require architectural innovation, not just scale”

Fei-Fei LiLatent Space

“Start small to avoid information overload.”

Ali KhanTitans Of Tomorrow