Transformers treat inputs as sets of tokens, which works well for language but is sub-optimal for spatial data that lives in 3-D. The discussion highlights the need for new primitives that map better to distributed hardware and for architectures that can capture physical laws implicitly.