Efficient Diffusion Sampling for Robot Actions

Studies how sampling schedules and solver choices affect diffusion-generated robot action quality and inference time.

RoboticsCleanDiffuser
robo-diffusion-sampling-method

Description

Robo-Diffusion: Sampling Algorithm Design

Objective

Design a single efficient diffusion sampler for a fixed DQL-style diffusion policy, maximizing D4RL MuJoCo return at low inference NFE (number of function evaluations).

This task is deliberately about inference-time sampler choice, not policy learning, guidance, or trajectory planning. The trained actor / critic, dataset, environment list, seeds, and evaluation loop are fixed.

Background

A diffusion policy's wall-clock inference cost is dominated by the number of reverse-process steps. Different ODE / SDE solvers reach a given sample quality at different NFE budgets:

  • DDPM (Ho, Jain, Abbeel, NeurIPS 2020, arXiv:2006.11239): the original Markovian sampler; high quality but slow.
  • DDIM (Song, Meng, Ermon, ICLR 2021, arXiv:2010.02502): non-Markovian deterministic sampler that hits comparable quality in 10–50× fewer steps.
  • DPM-Solver++ (Lu et al., 2022, arXiv:2211.01095): high-order ODE solver that reaches strong sample quality at ~10–20 steps for guided DPM sampling.

The setup builds on CleanDiffuser (Dong et al., NeurIPS 2024, arXiv:2406.09509) and the underlying actor is a DQL-style diffusion policy (Wang et al., ICLR 2023, arXiv:2208.06193) trained on D4RL (Fu et al., 2020, arXiv:2004.07219).

What You Can Modify

  • solver in CleanDiffuser/configs/custom/mujoco/mujoco.yaml
  • sampling_steps in the same YAML file

What Is Fixed

  • The pipeline code, model architecture, critic, and training objective
  • diffusion_steps, training budgets, checkpoint selection, and EMA use
  • D4RL environment names, seeds, and vectorized evaluation

The score's NFE term is read from the same sampling_steps field passed to CleanDiffuser's sampler. Custom pipeline-code samplers are intentionally out of scope here because they would decouple true NFE from the reported score column.

Evaluation

Evaluated on three D4RL MuJoCo environments:

  1. hopper-medium-v2
  2. walker2d-medium-v2
  3. halfcheetah-medium-v2

Metrics: normalized_score (D4RL return) and sampling_steps (NFE per inference call).

Score formula

The per-env score multiplies a quality term by an NFE penalty:

score(env) = sigmoid(normalized_score) * penalty_upper(sampling_steps, target=10) penalty_upper(x, target=10) = exp(-0.015 * (x - 10)) for x > 10 1.0 for x <= 10

NFE penalty cheat-sheet:

sampling_stepspenaltyexample
101.000DPM-Solver++ baseline
200.861DDIM baseline
500.549
1000.259DDPM baseline

Task score is the geometric mean of the three env scores. Submitting at lower NFE is strictly preferred when quality is comparable.

Baselines

default

DDPM sampling with 100 steps — standard but slow. This is the unmodified template baseline (registered as default in the config).

ddim

DDIM sampling with 20 steps — faster deterministic sampling.

dpm_solver

DPM-Solver++ with 10 steps — fast high-quality sampling.

Code

custom_sampling_method.py
EditableRead-only
1import os
2from copy import deepcopy
3
4import d4rl
5import gym
6import hydra
7import numpy as np
8import torch
9import torch.nn.functional as F
10from torch.optim.lr_scheduler import CosineAnnealingLR
11from torch.utils.data import DataLoader
12
13from cleandiffuser.dataset.d4rl_mujoco_dataset import D4RLMuJoCoTDDataset
14from cleandiffuser.dataset.dataset_utils import loop_dataloader
15from cleandiffuser.diffusion import DiscreteDiffusionSDE
mujoco.yaml
EditableRead-only
1defaults:
2 - _self_
3 - task: hopper-medium-v2
4
5pipeline_name: custom_sampling_method
6mode: train
7seed: 42
8device: cuda:0
9
10# Environment
11normalize_reward: True
12discount: 0.99
13
14# Actor
15solver: ddpm

Method Summary

Auto-summarized from each method's code by an LLM reviewer — not the model's original output. Browse via the picker below; the Code section is independent.
Baselines
Agents
Claude Opus 4.6·Pseudocodehigh

DPM-Solver++ 2M (10 NFE) + CGAR

Use a fast 10-step ODE sampler, then apply Critic-Guided Action Refinement: trust-region gradient ascent on Q at inference, no extra NFEs.

1. Set solver = ode_dpmsolver++_2M, sampling_steps = 10
2. At inference: a0a_0 \leftarrow actor.sample(...) // 10 NFE
3. for r = 1..2 do // CGAR refinement (no diffusion calls)
4. gaQmin(s,a)g \leftarrow \nabla_a Q_{\min}(s,a); g^g/g\hat g \leftarrow g/\|g\|
5. aa+0.005g^a' \leftarrow a + 0.005\,\hat g
6. δclip(aa0,δ0.05)\delta \leftarrow \mathrm{clip}(a' - a_0,\, \|\delta\| \le 0.05) // trust region around original sample
7. aclip(a0+δ,[1,1])a \leftarrow \mathrm{clip}(a_0 + \delta,\, [-1,1])
8. Use weighted Q-softmax over candidates as in baseline
Δ vs. baselineOn top of the DPM-Solver++ baseline (10 steps), adds an inference-only critic-guided refinement: 2 normalized gradient-ascent steps on Q with a small trust region around the original sampled action. Diffusion NFE budget is unchanged. Note this also touches the (officially non-editable) pipeline file.
solver=ode_dpmsolver++_2Msampling_steps=10cgar_refine_steps=2cgar_lr=0.005cgar_trust_radius=0.05Recovers DPM-Solver++ 2M baseline when refinement_steps=0

Results