Five diffusion papers worth reading today (May 26, 2026)
May 26, 2026 · 9:25 AM

Five diffusion papers worth reading today (May 26, 2026)

Tuesday's batch is the most architecturally diverse single-day selection this week. SKILD (MIT) encodes scale as a diffusion coordinate rather than a conditioning variable, enabling one unconditional model to handle both generation (FID 2.65 on CIFAR-10) and 2×–8× super-resolution. LoopMDM (KAIST) shows that selective transformer-layer looping delivers depth-scaling for masked diffusion language models at 3.3× fewer training FLOPs and +8.5 GSM8K points. DRM (Peking University, ICML 2026) repurposes a Flux-based diffusion model as a step-wise reward evaluator, replacing VLM-based reward models for alignment. A UC Berkeley theory paper (Malik, Abbeel et al.) establishes generalization bounds for multi-objective diffusion learning under semi-supervised regimes. Paris 2.0 demonstrates that video diffusion training — typically assumed to require monolithic GPU clusters — can be done decentralizedly, cutting FVD from 561.04 to 279.01 at matched compute.

Tuesday's batch is unusually diverse in problem class. SKILD (MIT) redesigns the forward process itself to unify generation and super-resolution in a single unconditional model. LoopMDM (KAIST) asks whether masked diffusion language models can borrow depth from looping rather than adding parameters. DRM (Peking University, ICML 2026) turns a generation model inside-out to use it as a reward evaluator. A UC Berkeley theory paper formalizes what happens statistically when one diffusion model must serve multiple objectives. And Paris 2.0 asks whether video diffusion training requires a monolithic GPU cluster at all — and answers no.

1. SKILD: MIT unifies generation and continuous super-resolution without task-specific architecture

ArXiv: 2605.26032 | Zixin Jessie Chen, Zhuo Chen, Archer Wang, Jeff Gore, William T. Freeman, Congyue Deng, Marin Soljačić | cs.CV | MIT
Peer-review status: Preprint. Code available at github.com/JazzyCH/SKILD.
The standard approach to combining generation and super-resolution is to train two separate models — or one conditional model that takes a scale factor as input. SKILD (Scale-Invariant K-space Image Learning Diffusion) takes a different route: it trains a single unconditional diffusion model and routes both tasks through it by varying only the starting timestep. Generation starts from pure noise; super-resolution starts from a partially noised version of the low-resolution input, at a timestep calibrated to the target scale. No conditioning branch, no classifier-free guidance, no per-scale retraining. 1
The mechanism rests on scale invariance in natural image statistics. Natural images exhibit power-law decay in their variance spectra — a property the authors verify across CIFAR-10, ImageNet-128, and ImageNet-256. SKILD's forward process is designed to match this: it attenuates image content from fine to coarse scales while injecting spectrum-matched Gaussian noise, so scale becomes an explicit coordinate of the diffusion dynamics rather than a conditioning variable. The authors describe this as shifting the modeling burden from conditional mappings and task-specific architectures onto the design of the forward process itself. 1
SKILD conceptual illustration on a self-similar fractal image: (a) effective signal resolution decreases; (b) injected noise correlation length increases; (c) the process obeys scale invariance
SKILD forward process on a fractal input — signal resolution and noise correlation length trade off in a scale-invariant way. 1
Benchmark results: On unconditional CIFAR-10 generation, SKILD achieves FID 2.65 and Inception Score 9.63, competitive with DDPM and EDM baselines. On ImageNet, the same checkpoint performs 2×–8× super-resolution, outperforming conditional models (BSRGAN and others) across PSNR, SSIM, LPIPS, CLIPIQA, and MUSIQ. A physics validation experiment on critical Ising model configurations — where connected four-point correlations closely track ground truth — extends the claim beyond natural images. 1
Code/resources: github.com/JazzyCH/SKILD
Why read it: The transferable idea is treating scale as a diffusion coordinate rather than a conditioning input. That reframing applies to any downstream task where scale variation is a deployment reality — medical imaging, satellite imagery, generative SR pipelines. The physics validation is an unusual credibility check: if the model's spectra match Ising correlations, the scale invariance claim has support outside the benchmark regime.

2. LoopMDM: KAIST cuts masked diffusion LM training FLOPs by 3.3× with selective layer looping

ArXiv: 2605.26106 | Sanghyun Lee, Chunsan Hong, Seungryong Kim, Jonghyun Lee, Jongho Park, Dongmin Park | cs.LG | KAIST
Peer-review status: Preprint. Code and weights to be publicly released.
Masked Diffusion Models (MDMs) for language — models that generate text by iteratively unmasking tokens across multiple denoising rounds — have remained architecturally conservative compared to their autoregressive counterparts. LoopMDM's premise is that depth is what MDMs are missing, and that depth can be obtained without adding parameters: selectively looping the early-to-middle transformer layers forces the model to process its input multiple times per denoising step, producing a depth-scaling effect from a shallow stack. 2
Two knobs control the behavior. At training time, looping reduces the FLOPs required to reach a target test negative log-likelihood: the model trains on matched-compute budgets and reaches the same NLL faster than a non-looped baseline. At inference time, the loop count can be increased beyond the training configuration, providing a dial for compute-quality trade-offs after the model is already trained. An attention analysis reveals why the loop count matters: looping promotes attention interactions among masked positions — positions that, in a single-pass model, never directly attend to each other within a step. 2
LoopMDM overview: (left) selective looping on early-to-middle denoising layers; (center) matched-compute training curves showing faster NLL convergence; (right) GSM8K accuracy improves monotonically with more inference loops
LoopMDM: the three panels show the looping architecture, training efficiency, and inference-time scaling behavior. 2
Benchmark results: LoopMDM matches same-size non-looped MDMs with 3.3× fewer training FLOPs. On GSM8K (a math reasoning benchmark), it surpasses deeper non-looped MDMs by +8.5 points at comparable per-step compute. Generative perplexity decreases monotonically with loop count: from 116.71 ± 2.72 at S=1 to 42.56 ± 1.09 at S=4 (1024 denoising steps). Zero-shot perplexity improves across PTB, WikiText, LM1B, Lambada, AG News, PubMed, and ArXiv benchmarks. 2
Code/resources: Not yet released; the authors indicate public release is planned.
Why read it: The training FLOPs reduction (3.3×) and the inference-time dial are independent benefits — you get both from the same architectural change. For teams training MDMs under compute constraints, this is a training-side intervention with no penalty at inference if you keep the loop count fixed. The GSM8K improvement also suggests that looping is doing something specifically useful for multi-step reasoning, which matters if you're targeting coding or math tasks with discrete diffusion.

3. DRM: Peking University turns a diffusion model into its own reward evaluator (ICML 2026)

ArXiv: 2605.25661 | Jaxon Zhang, Binxin Yang, Hubery Yin, Chen Li, Jing Lyu | cs.CV | Peking University
Peer-review status: Accepted at ICML 2026. Code available at github.com/jjaxonx/DRM.
Reward models for diffusion model alignment — HPSv3, PickScore, ImageReward — are trained on VLMs (Vision-Language Models) fine-tuned for preference prediction. The DRM argument is that VLMs are the wrong backbone: they are pre-trained for semantic alignment, not for the aesthetic and compositional attributes that drive human preference at the pixel level. A model that can generate high-fidelity images must, by virtue of having learned to do so, possess an implicit understanding of those attributes. DRM (Diffusion-based Reward Model) operationalizes this by removing the last three transformer layers from a pre-trained Flux-based diffusion model and training the resulting architecture on preference data (HPDv3, Pick-A-Pic, and ImageReward subsets). 3
The structural advantage is step-wise evaluation. Standard reward models treat generation as a black box: they score only the final output. DRM can score any noisy intermediate latent at any denoising stage — it was trained on those latents, so it has representations for them. This enables two downstream uses. Step-wise GRPO provides dense per-step rewards during reinforcement learning training, addressing the credit-assignment problem that arises when a single terminal reward must propagate back through dozens of denoising steps. Step-wise Sampling uses DRM as an inference-time guide: at each step, multiple candidate continuations are scored and the best one is carried forward. 3
DRM vs. VLM-based reward models: existing RMs provide only terminal reward; DRM evaluates at any intermediate denoising stage
DRM architecture comparison: VLM-based RMs score the final image; DRM scores at any denoising stage. 3
Benchmark results: Experiments on SD3.5-Medium (Stable Diffusion 3.5 Medium) show DRM-optimized generations achieve superior visual quality compared to HPSv3, PickScore, and ImageReward baselines. Quantitative comparisons from the preference tables were not fully accessible in HTML-rendered form; the ICML 2026 acceptance provides a credibility check that the methodology was independently reviewed. 3
Code/resources: github.com/jjaxonx/DRM
Why read it: The backbone-substitution argument is the idea to stress-test. If a generation model's internal representations genuinely contain more perceptual signal than a VLM fine-tuned on preference data, the principle extends: any sufficiently capable generative model could serve as a better reward model for its own domain than an external discriminator trained separately. Step-wise GRPO is also directly practical — teams running GRPO on diffusion models have reported instability from sparse terminal rewards, and a dense per-step signal from the same architectural family is a low-overhead fix to try.

4. Multi-objective diffusion learning: UC Berkeley establishes a statistical theory for Pareto trade-offs

ArXiv: 2605.25210 | Ziheng Cheng, Yixiao Huang, Hanlin Zhu, Haoran Geng, Somayeh Sojoudi, Jitendra Malik, Pieter Abbeel, Xin Guo | cs.LG | UC Berkeley
Peer-review status: Preprint (submitted 2026-05-24). No code repository confirmed at time of writing.
A deployed diffusion model rarely serves one distribution. A text-to-image model must handle diverse prompt domains. A robotic diffusion policy must generalize across environments. Each condition defines a different target distribution, and maximizing Pareto trade-offs across all of them simultaneously requires model capacity — which drives up sample complexity. This paper develops the statistical theory for that trade-off, formalized as multi-objective learning (MOL) of conditional diffusion models under a semi-supervised regime. 4
The key result: the number of paired samples required for the generalist model to reach a target error bound depends only on the complexity of the specialist models, not the generalist. This matters because the generalist, by absorbing pseudo-samples generated from abundant unlabeled condition data, can grow in capacity without proportionally increasing the paired-data requirement. The two-stage procedure is: (1) fit lightweight specialists from limited paired data; (2) generate pseudo-samples from abundant unlabeled condition data, then distill into a generalist. The theory extends to diffusion policies for sequential decision-making, accounting for distribution shift between training and on-policy rollouts. 4
Semi-supervised multi-objective framework: limited paired data feeds specialist fitting; abundant unlabeled condition data feeds generalist distillation
The two-stage pipeline: specialists trained on scarce paired data generate pseudo-samples that scale the generalist. 4
Benchmark results: Experiments on robotic manipulation with domain randomization at four difficulty levels, and on image restoration tasks, validate the theoretical generalization bounds. The paper does not claim state-of-the-art on any single benchmark; the contribution is the proof that specialist-to-generalist distillation reduces paired sample requirements, and the experimental section confirms that the bound is not vacuous. 4
Code/resources: Not confirmed at time of writing.
Why read it: The senior authors — Jitendra Malik and Pieter Abbeel, both UC Berkeley faculty — bring credibility to the theoretical framing. More practically, the result has direct implications for anyone training a single diffusion model across multiple domains with limited labeled data per domain. The semi-supervised setup (scarce paired data, abundant unlabeled conditions) is the norm rather than the exception in robotics and medical imaging, and the theory gives a principled justification for the common-practice intuition that specialist-then-generalist training is data-efficient.

5. Paris 2.0: the first video diffusion model trained through fully decentralized computation

ArXiv: 2605.26064 | Ali Rouzbayani, Bidhan Roy, Marcos Villagra, Zhiying Jiang | cs.CV | independent (Paris project)
Peer-review status: Preprint (submitted 2026-05-25). No code repository confirmed at time of writing.
Video diffusion training is assumed to require tightly coupled GPU clusters: the combination of spatial, temporal, and cross-attention across long token sequences demands fast inter-GPU communication, and the community has treated this as a fixed constraint. Paris 2.0 challenges that assumption by extending the Decentralized Diffusion Model (DDM) paradigm — first demonstrated for images in Paris 1.0 (arXiv:2510.03434) — to temporally coherent video generation. The key challenge is that video generation requires synchronizing not just spatial consistency but temporal coherence across frames, a problem that does not arise in single-image generation. 5
The training communication stack handles data, pipeline, tensor, and context parallelism, each with different synchronization costs across distributed, heterogeneous GPUs. The authors do not claim that decentralized training is free — each parallelism mode introduces its own latency. What they claim is that at matched total compute budget, a DDM-trained video model can produce better outputs than the monolithic centralized baseline. The proposed explanation is implicit regularization from asynchronous updates, though this is the authors' interpretation rather than a formally derived result. 5
Paris 2.0 qualitative video samples: each row shows eight frames from one generated video, demonstrating frame-to-frame temporal coherence
Paris 2.0 video samples — eight consecutive frames per row showing temporal coherence. 5
Benchmark results: In the low-resolution text-to-video setting, Paris 2.0 cuts Fréchet Video Distance (FVD) from 561.04 (monolithic baseline, identical data and compute) to 279.01 — roughly a 2.0× improvement. CLIP text-video similarity and aesthetic scores both exceed the monolithic baseline at matched compute. 5
Code/resources: Not confirmed at time of writing.
Why read it: The FVD gain is large enough to rule out noise — a 2× improvement at matched compute is a strong result. But the more consequential claim is structural: if video diffusion training does not require a monolithic cluster, then the compute required to train competitive video models becomes accessible to research groups without NVIDIA infrastructure contracts. Whether that generalizes from the low-resolution experimental regime to production-scale video is an open question the paper does not answer, which is what makes it worth tracking.

Quick reference

PaperCore contributionInstitutionPeer-review statusCode
2605.26032 — SKILDScale-invariant forward process; single unconditional model handles generation + 2×–8× SR; FID 2.65 on CIFAR-10MITPreprintGitHub (open)
2605.26106 — LoopMDMSelective layer looping for MDMs; 3.3× training FLOPs reduction, +8.5 pts GSM8KKAISTPreprintPlanned release
2605.25661 — DRMDiffusion model as reward backbone; step-wise evaluation at any denoising stage; Step-wise GRPO + SamplingPeking UniversityICML 2026GitHub (open)
2605.25210 — MOLStatistical theory for multi-objective diffusion learning; specialist-to-generalist distillation reduces paired sample requirementsUC BerkeleyPreprintNot confirmed
2605.26064 — Paris 2.0First video diffusion model trained via decentralized computation; FVD 561 → 279 vs. monolithic baseline at matched computeParis projectPreprintNot confirmed
Tuesday's papers share a common structural move: each one relocates a burden that the field has been placing on conditioning or post-hoc mechanisms. SKILD moves super-resolution from a conditioning branch into the forward process design. LoopMDM moves depth from parameter count into loop count. DRM moves reward estimation from a separate VLM into the generative model's own representations. The MOL paper moves multi-domain generalization from single-task repetition into a principled specialist-distillation framework. Paris 2.0 moves cluster dependency from a hard infrastructure constraint into a design variable. Whether these moves prove durable is a question for peer review and replication — but they represent coherent theoretical bets rather than incremental tuning.
Cover image: AI-generated illustration

More from this channel

Related content

  • Sign in to comment.