MemCast

How did diffusion LLMs get so fast?

Why diffusion language models break the token‑by‑token bottleneck of traditional LLMs and the training, inference, and system tricks that make them dramatically faster.

22m·Guest Stefano Ermon·Host Julia Turc·

Sequential Bottleneck vs. Parallel Drafting

1 / 5

Traditional auto‑regressive models are limited by their need to generate tokens one after another, which forces inference latency to grow linearly with output length. Diffusion LLMs sidestep this by producing a full draft in parallel and then iteratively denoising it, collapsing the latency curve to a constant. The speed‑up only appears when the number of refinement steps stays low and each step costs roughly the same as a single auto‑regressive pass.

Auto‑regressive LLMs are fundamentally limited by sequential token generation, causing linear latency growth.
  • Each token is generated after the previous one, so the model must re‑ingest its own output at every step.
  • This creates a lower bound on inference time that scales linearly with the number of output tokens.
  • The sequential dependency prevents full GPU utilization because only one token is processed at a time.
  • As a result, adding more GPUs does not eliminate the bottleneck; the model remains bound by the token‑by‑token pipeline.
Traditional LLMs produce text responses sequentially, one token at a time until the special end-of-sequence token is generated. Stefano Ermon
This sequential dependency puts a lower bound on inference latency, which is linear in the number of output tokens. Stefano Ermon
Diffusion LLMs generate an entire response draft at once, turning inference complexity from linear to constant.
  • Instead of emitting tokens one by one, diffusion models start with a fully masked sequence and produce a complete gibberish draft.
  • Through a fixed number of denoising steps, the draft is progressively cleaned into coherent text.
  • Because the whole draft is processed in parallel, the runtime no longer depends on output length, only on the number of diffusion steps.
  • This leverages the full capacity of modern GPUs, delivering massive speed‑ups over auto‑regressive baselines.
They generate an entire response draft at once and refine it progressively, taking full advantage of the GPU capacity. Julia Turc
So diffusion LLMs bring down inference time complexity from linear in the number of output tokens to a constant. Stefano Ermon
The diffusion speed‑up only materializes when refinement steps are few and each step costs about the same as a single auto‑regressive pass.
  • Two conditions are required: (1) a low number of denoising steps, and (2) each diffusion step must be computationally cheap.
  • If many steps are needed, the constant‑time advantage disappears because the total work grows.
  • Likewise, if a diffusion step is more expensive than a regular AR pass, the parallelism advantage is offset.
  • Therefore, research focuses on training and inference tricks that shrink both the step count and per‑step cost.
But the speed-up promise only materializes if two things are true. The number of refinement steps has to be low, and each diffusion step has to cost roughly the same as a single auto-regressive pass. Stefano Ermon
If you need a lot of diffusion steps, then there is no benefit compared to an auto-regressive model. Stefano Ermon

Training Tricks to Shrink Diffusion Paths

2 / 5

Because diffusion inference cost is tied to the number of denoising steps, researchers have devised training methods that keep long, information‑rich training trajectories while forcing the model to converge in far fewer steps at test time. Self‑distillation and curriculum learning are the two most effective approaches, letting a student model inherit the teacher’s knowledge and gradually learn to denoise from easier noise levels.

Self‑distillation can halve inference steps by training a student to mimic a teacher’s longer diffusion path.
  • A large teacher model is first trained on a full diffusion schedule (e.g., N steps).
  • The teacher’s intermediate states are frozen and used as supervision for a student that is fine‑tuned to skip steps (e.g., N→N/2).
  • The student aligns its states with every second teacher state, learning to achieve the same denoising quality in half the steps.
  • Repeating the process can further compress the path, yielding models that are twice as fast without loss in quality.
The end result is a model that is now twice as fast and just as good. Stefano Ermon
We can repeat the distillation process: duplicate the student, freeze one copy, and fine-tune the other to shrink the paths yet again. Stefano Ermon
Curriculum learning eases diffusion training by starting with easy noise levels and gradually increasing difficulty, leading to more efficient single‑step denoising.
  • During training, each step randomly selects a noise level; early in training the model sees relatively clean inputs, making the denoising task simple.
  • As training progresses, higher‑noise examples are introduced, forcing the model to learn more powerful denoising operations.
  • This staged difficulty mirrors a school curriculum, improving robustness and allowing the model to achieve more progress per diffusion step.
  • Empirically, models trained with curriculum learning need fewer refinement steps at inference.
The philosophy behind curriculum learning is to ease the model into it, train it on easier tasks first with reasonably clean data. Stefano Ermon
It turns out this makes the model more robust and efficient, able to achieve more in a single step, and that indirectly leads to fewer refinement steps. Stefano Ermon
Forward diffusion creates fine‑grained training signals, but long reverse paths hurt inference; balancing long training paths with short inference paths is key.
  • Forward diffusion gradually corrupts clean text into gibberish, providing many intermediate states that serve as supervision.
  • These intermediate steps make learning feasible because the model solves a series of simpler sub‑problems.
  • However, if the reverse diffusion mirrors the full forward schedule, inference would require the same many steps, negating speed gains.
  • Techniques like distillation and curriculum learning aim to keep the rich training signal while collapsing the inference path.
This takes a clean text and gradually corrupts tokens until it turns into gibberish. Stefano Ermon
Ideally, we'd want long training paths and short inference ones. Stefano Ermon

Smart Sampling and Guided Diffusion

3 / 5

Even with fewer diffusion steps, the choice of which tokens to unmask at each iteration dramatically affects speed and quality. Confidence‑based remasking is a simple heuristic, but global consistency requires a higher‑level signal. Guided diffusion introduces a lightweight auto‑regressive verifier that corrects inconsistent tokens, delivering massive speed‑ups while preserving quality.

Confidence‑based remasking selects low‑certainty tokens but cannot resolve global token‑level inconsistencies.
  • Each token prediction includes a probability distribution; flat distributions indicate uncertainty.
  • The sampler remasks tokens with low confidence, allowing the next diffusion step to focus on them.
  • However, this approach treats tokens independently, so it cannot detect when two simultaneously unmasked tokens conflict (e.g., duplicate entity names).
  • Consequently, structural errors like repeated city names persist despite high confidence on individual tokens.
Sharp distributions signal high model confidence and flat distributions are good candidates for remasking. Stefano Ermon
Confidence scores can't fix this problem. They tell you how sure the model is about each token in isolation, but not whether the sentence makes sense as a whole. Stefano Ermon
Guided diffusion uses a lightweight auto‑regressive supervisor to spot and fix inconsistencies, achieving up to 12× speed‑up over vanilla diffusion.
  • The supervisor receives the partially unmasked draft and predicts the next word only for the newly revealed positions.
  • Its predictions are compared to the diffusion model’s output; tokens with large disagreement are remasked.
  • Because the diffusion model already supplies a full draft, the supervisor can evaluate all mask positions in a single forward pass, adding minimal overhead.
  • Empirically, FlashDLM (a guided diffusion system) was 12 times faster than the baseline LLaDA while maintaining quality.
A reasonable way to do this is guided diffusion, proposed by FlashDLM: a lightweight auto-regressive model that supervises the unmasking process. Stefano Ermon
In fact, FlashDLM was 12 times faster than LLaDA. Stefano Ermon
Smart samplers that prioritize high‑uncertainty tokens can drastically cut the number of refinement steps needed.
  • A sampler decides both how many tokens to commit and which ones to unmask at each step.
  • By selecting tokens with the highest uncertainty (flat distributions) early, the model resolves the hardest parts first, reducing the need for later corrections.
  • Random remasking is simple but sub‑optimal because it may waste steps on already‑confident tokens.
  • The combination of confidence scoring and strategic token selection yields fewer diffusion iterations while preserving output quality.
The sampler is the inference time algorithm that decides at each step how many tokens to commit and which ones to uncover. Stefano Ermon
A smart sampler can drastically reduce the number of refinement steps because operating on the right tokens at the right time will require fewer corrections in the future. Stefano Ermon

KV Caching Challenges and Block Diffusion

4 / 5

Key‑Value (KV) caching accelerates auto‑regressive inference by reusing past attention results, but diffusion models’ bidirectional attention invalidates caches across steps. Approximate prompt caching mitigates some overhead, while block diffusion introduces a semi‑autoregressive regime that restores exact KV caching per block and enables variable‑length generation.

KV caching works for auto‑regressive models due to causal attention, but diffusion models’ bidirectional attention makes traditional KV caching impossible.
  • In AR models, each token only attends to left‑hand context, so its key/value representations stay constant across future steps.
  • Diffusion models attend bidirectionally, meaning every token’s representation depends on the entire current draft, which changes after each unmasking.
  • Consequently, a cached key/value from a previous step becomes stale as soon as any token is updated, forcing a full recomputation.
  • This fundamental difference prevents direct reuse of existing AR serving engines for diffusion LLMs.
KV caching is feasible because of the causal attention. Stefano Ermon
But that is not the case for diffusion models. Stefano Ermon
Approximate caching of prompt embeddings reduces recomputation because prompt semantics change little across denoising steps.
  • Studies show that the embeddings of the prompt portion remain almost unchanged throughout the diffusion process, even though the response evolves.
  • By caching these stable prompt embeddings once and reusing them, the model avoids redundant matrix multiplications at each step.
  • Some works (e.g., dLLM‑Cache) refresh the cache periodically (e.g., every 100 steps) to balance staleness and speed.
  • This approximation yields noticeable speed gains without materially affecting output quality.
Empirically, studies have shown that the prompt embeddings change very little across consecutive denoising steps. Stefano Ermon
I'm calling this cache approximate because the values are only an approximation of the fresh ones we're avoiding to recompute. Stefano Ermon
Block diffusion (semi‑autoregressive generation) restores exact KV caching by generating blocks sequentially while diffusing within each block.
  • The context window is split into fixed‑size blocks; each block is processed with diffusion, but blocks themselves are generated left‑to‑right like an AR model.
  • Within a block, tokens attend bidirectionally to each other, but only causally to previous blocks, preserving the validity of cached keys/values for completed blocks.
  • After a block finishes, its final activations are cached and reused for subsequent blocks, achieving exact KV caching without approximation.
  • This hybrid approach also enables variable‑length generation, stopping early once an end‑of‑sequence token appears, further improving efficiency.
The trick is to bring back auto regression. Within a single block, tokens are generated with diffusion, but blocks themselves are generated sequentially from left to right. Stefano Ermon
Once a block is finalized, the activations in its last denoising step can be cached and reused by future blocks. Stefano Ermon

Future Outlook: Diffusion as the Dominant Paradigm

5 / 5

The speaker predicts diffusion language models will overtake auto‑regressive architectures for text, code, and other discrete generation tasks. Their superior scaling at inference time, combined with emerging open‑source and commercial offerings, makes serving efficiency the primary competitive edge.

Diffusion LLMs are expected to replace auto‑regressive models as the leading paradigm for text and code generation.
  • The speaker states that diffusion models will become the primary technology not only for images and video but also for discrete objects like text and code.
  • Early benchmarks (e.g., Mercury Coder) already show 5× speed improvements over similarly sized AR models.
  • As research pushes the gap wider, diffusion models will likely dominate production workloads where latency matters.
  • This shift will reshape the AI ecosystem, influencing both open‑source projects and commercial APIs.
We think of them as a replacement for sure. Stefano Ermon
I see a future where diffusion models will become the leading paradigm, not just for image and video generation, but also for discrete objects like text and code. Stefano Ermon
Diffusion models scale better at inference because parallelism reduces per‑token latency, making serving efficiency the key competitive factor.
  • Unlike AR models where latency grows with token count, diffusion inference time is roughly constant, allowing larger outputs without proportional slowdown.
  • This property means that hardware utilization and system‑level optimizations (e.g., block diffusion, approximate caching) become the main levers for performance gains.
  • The speaker emphasizes that “the only thing that matters is how efficiently you can serve these models at test time.”
  • Consequently, companies will compete on inference infrastructure, latency budgets, and cost‑per‑token rather than raw model size alone.
I think they're fundamentally better in terms of the way they scale at inference time. Stefano Ermon
And we're getting to a point where the only thing that matters is really how efficiently you can serve these models at test time, at inference time. Stefano Ermon
Open‑source libraries and commercial APIs are rapidly emerging, lowering the barrier to experiment with diffusion LLMs.
  • The host points listeners to HuggingFace tags like “dllm” or “llada” where over a hundred open‑source diffusion models are available.
  • Inception offers commercial endpoints compatible with OpenAI’s API, as well as specialized services (e.g., fill‑in‑the‑middle for code).
  • This ecosystem democratizes access, allowing developers to test diffusion models without building their own training pipelines.
  • As more tools appear, adoption will accelerate, reinforcing diffusion’s trajectory toward mainstream use.
HuggingFace is a great way to find open source models. You can look for tags like "dllm" or "llada" and find more than a hundred models. Julia Turc
If you're looking for a commercial API, Inception offers multiple endpoints for their Mercury models. Julia Turc
⚙ Agent-readable JSON index — click to expand
{
  "memcast_version": "0.1",
  "episode":  {
    "id": "-VGeHZqOk_s",
    "title": "How did diffusion LLMs get so fast?",
    "podcast": "Julia Turc",
    "guest": "Stefano Ermon",
    "host": "Julia Turc",
    "source_url": "https://www.youtube.com/watch?v=-VGeHZqOk_s",
    "duration_minutes": 22
  },
  "concepts":  [
    {
      "id": "sequential-bottleneck-vs-parallel-drafting",
      "title": "Sequential Bottleneck vs. Parallel Drafting",
      "tags":  []
    },
    {
      "id": "training-tricks-to-shrink-diffusion-paths",
      "title": "Training Tricks to Shrink Diffusion Paths",
      "tags":  []
    },
    {
      "id": "smart-sampling-and-guided-diffusion",
      "title": "Smart Sampling and Guided Diffusion",
      "tags":  []
    },
    {
      "id": "kv-caching-challenges-and-block-diffusion",
      "title": "KV Caching Challenges and Block Diffusion",
      "tags":  []
    },
    {
      "id": "future-outlook-diffusion-as-the-dominant-paradigm",
      "title": "Future Outlook: Diffusion as the Dominant Paradigm",
      "tags":  []
    }
  ]
}