Agent Conversation: anthropic/claude-opus-4.6
Task: cv-diffusion-prediction
Diffusion Prediction Parameterization
Background
In DDPM training, the model predicts a target from noisy input x_t. The noisy sample is constructed as:
x_t = sqrt(alpha_t) * x_0 + sqrt(1 - alpha_t) * epsilon
There are three standard parameterizations for what the model predicts:
- Epsilon prediction (Ho et al., 2020): predict the noise epsilon
- X0 prediction: directly predict the clean image x_0
- V-prediction (Salimans & Ho, 2022): predict the velocity v = sqrt(alpha_t) * epsilon - sqrt(1 - alpha_t) * x_0
These are mathematically equivalent (one can be converted to any other), but they result in different loss landscapes and training dynamics, leading to different FID scores under the same training budget.
Research Question
Can we design a prediction parameterization that achieves better FID than the standard epsilon, v-prediction, and x0-prediction baselines?
Task
You are given custom_train.py, a self-contained training script that trains
an unconditional UNet2DModel (google/ddpm-cifar10-32 architecture) on CIFAR-10.
The editable region contains two functions:
-
compute_training_target(x_0, noise, timesteps, schedule)— defines what the model should predict during training. -
predict_x0(model_output, x_t, timesteps, schedule)— recovers the predicted clean image from the model's output (used during DDIM sampling).
These two functions must be consistent: the sampling procedure must correctly invert the training parameterization.
The schedule dict provides precomputed noise schedule tensors:
alphas_cumprod: cumulative product of (1 - beta)sqrt_alpha: sqrt(alphas_cumprod)sqrt_one_minus_alpha: sqrt(1 - alphas_cumprod)
Evaluation
- Dataset: CIFAR-10 (32x32)
- Model: UNet2DModel (diffusers backbone) at three scales:
- Small: block_out_channels=(64,128,128,128), ~9M params, batch 128
- Medium: block_out_channels=(128,256,256,256), ~36M params, batch 128
- Large: block_out_channels=(256,512,512,512), ~140M params, batch 64
- Training: 35000 steps per scale, AdamW lr=2e-4, EMA rate 0.9995, 8-GPU DDP
- Inference: 50-step DDIM sampler
- Metric: FID (lower is better), computed with clean-fid against CIFAR-10 train set (50k samples)
Baselines
Your method is evaluated on train_small and train_medium (train_large is skipped during agent runs). FID (lower is better) for the 3 baselines:
| Name | Target | fid (small) | fid (medium) |
|---|---|---|---|
epsilon | predict noise ε (standard DDPM parameterization) | 18.58 | 8.46 |
vpred | predict velocity v = α_t * ε - σ_t * x_0 (v-prediction, Salimans & Ho 2022) | 21.70 | 11.59 |
x0pred | predict clean data x_0 directly | 25.25 | 13.56 |
The strongest baseline is epsilon at fid 18.58 (small) / 8.46 (medium) — standard noise prediction remains the best single-target parameterization at this scale. Your method must achieve **lower fid than `epsi
Tool Input
1{
2 "op": "replace",
3 "filename": "diffusers-main/custom_train.py",
4 "start_line": 83,
5 "end_line": 118,
6 "content": "\ndef compute_training_target(x_0, noise, timesteps, schedule):\n \"\"\"SNR-adaptive blended epsilon-v prediction.\n \n Smoothly interpolates between epsilon prediction (at high noise / low SNR)\n and v-prediction (at low noise / high SNR) using a sigmoid gate on log-SNR.\n \n Motivation: epsilon prediction implicitly weights the x0-MSE by SNR(t),\n which overweights easy low-noise timesteps. V-prediction weights by SNR+1,\n which is more uniform but loses epsilon's advantage at high noise.\n By blending with a sigmoid on log-SNR, we get an effective weighting that\n transitions smoothly: ~SNR at high noise (like epsilon) and ~SNR+1 at \n low noise (like v-pred), achieving a soft min-SNR-like truncation effect\n without explicit loss reweighting.\n \"\"\"\n # Hardcoded hyperparameters for the adaptive blend\n gamma = 1.0 # log-SNR crossover point \n tau = 1.0 # sigmoid temperature\n \n sqrt_alpha = schedule[\"sqrt_alpha\"][timesteps].view(-1, 1, 1, 1)\n sqrt_one_minus_alpha = schedule[\"sqrt_one_minus_alpha\"][timesteps].view(-1, 1, 1, 1)\n \n alpha = schedule[\"alphas_cumprod\"][timesteps].view(-1, 1, 1, 1)\n snr = alpha / (1.0 - alpha).clamp(min=1e-8)\n log_snr = torch.log(snr)\n \n # Sigmoid blend weight: w→0 at low SNR (eps), w→1 at high SNR (v-pred)\n w = torch.sigmoid((log_snr - math.log(gamma)) / tau)\n \n # Epsilon target and v-prediction target\n eps_target = noise\n v_target = sqrt_alpha * noise - sqrt_one_minus_alpha * x_0\n \n return (1.0 - w) * eps_target + w * v_target\n\n\ndef predict_x0(model_output, x_t, timesteps, schedule):\n \"\"\"Recover x_0 from the blended prediction.\n \n Derived from: target = (1-w)*eps + w*v, x_t = sqrt(a)*x_0 + sqrt(1-a)*eps\n => x_0 = [(1-w + w*sqrt(a))*x_t - sqrt(1-a)*model_output] / [(1-w)*sqrt(a) + w]\n \"\"\"\n gamma = 1.0\n tau = 1.0\n \n sqrt_alpha = schedule[\"sqrt_alpha\"][timesteps].view(-1, 1, 1, 1)\n sqrt_one_minus_alpha = schedule[\"sqrt_one_minus_alpha\"][timesteps].view(-1, 1, 1, 1)\n \n alpha = schedule[\"alphas_cumprod\"][timesteps].view(-1, 1, 1, 1)\n snr = alpha / (1.0 - alpha).clamp(min=1e-8)\n log_snr = torch.log(snr)\n \n w = torch.sigmoid((log_snr - math.log(gamma)) / tau)\n \n numerator = (1.0 - w + w * sqrt_alpha) * x_t - sqrt_one_minus_alpha * model_output\n denominator = (1.0 - w) * sqrt_alpha + w\n \n return numerator / denominator.clamp(min=1e-8)\n\n"
7}OK: Replaced lines 83..118 with 59 line(s) in diffusers-main/custom_train.py. Editable range: 83–141.
[Current file: diffusers-main/custom_train.py | editable: 83–141 | total: 515 lines]
80| # ============================================================================
81| # Prediction parameterization (EDITABLE REGION)
82| # ============================================================================
83
84 def compute_training_target(x_0, noise, timesteps, schedule):
85 """SNR-adaptive blended epsilon-v prediction.
86
87 Smoothly interpolates between epsilon prediction (at high noise / low SNR)
88 and v-prediction (at low noise / high SNR) using a sigmoid gate on log-SNR.
89
90 Motivation: epsilon prediction implicitly weights the x0-MSE by SNR(t),
91 which overweights easy low-noise timesteps. V-prediction weights by SNR+1,
92 which is more uniform but loses epsilon's advantage at high n
Tool Input
1{
2 "op": "replace",
3 "filename": "diffusers-main/custom_train.py",
4 "start_line": 141,
5 "end_line": 141,
6 "content": "\n"
7}OK: Replaced lines 141..141 with 1 line(s) in diffusers-main/custom_train.py. Editable range: 83–141.
[Current file: diffusers-main/custom_train.py | editable: 83–141 | total: 515 lines]
80| # ============================================================================
81| # Prediction parameterization (EDITABLE REGION)
82| # ============================================================================
83
84 def compute_training_target(x_0, noise, timesteps, schedule):
85 """SNR-adaptive blended epsilon-v prediction.
86
87 Smoothly interpolates between epsilon prediction (at high noise / low SNR)
88 and v-prediction (at low noise / high SNR) using a sigmoid gate on log-SNR.
89
90 Motivation: epsilon prediction implicitly weights the x0-MSE by SNR(t),
91 which overweights easy low-noise timesteps. V-prediction weights by SNR+1,
92 which is more uniform but loses epsilon's advantage at high n
Tool Input
1{
2 "content": "\n return numerator / denominator.clamp(min=1e-8)\n\n",
3 "end_line": 144,
4 "filename": "diffusers-main/custom_train.py",
5 "op": "replace",
6 "start_line": 140
7}ERROR: Lines 140..144 exceed the editable range. You may only edit lines 83–141.
[Current file: diffusers-main/custom_train.py | editable: 83–141 | total: 515 lines]
80| # ============================================================================
81| # Prediction parameterization (EDITABLE REGION)
82| # ============================================================================
83
84 def compute_training_target(x_0, noise, timesteps, schedule):
85 """SNR-adaptive blended epsilon-v prediction.
86
87 Smoothly interpolates between epsilon prediction (at high noise / low SNR)
88 and v-prediction (at low noise / high SNR) using a sigmoid gate on log-SNR.
89
90 Motivation: epsilon prediction implicitly weights the x0-MSE by SNR(t),
91 which overweights easy low-noise timesteps. V-prediction weights by SNR+1,
92 which is more uniform but loses epsilon's advantage at high noise.
93 By
Tool Input
1{
2 "content": "\ndef compute_training_target(x_0, noise, timesteps, schedule):\n \"\"\"SNR-adaptive blended epsilon-v prediction.\n \n Smoothly interpolates between epsilon prediction (at high noise / low SNR)\n and v-prediction (at low noise / high SNR) using a sigmoid gate on log-SNR.\n \n Motivation: epsilon prediction implicitly weights the x0-MSE by SNR(t),\n which overweights easy low-noise timesteps. V-prediction weights by SNR+1,\n which is more uniform but loses epsilon's advantage at high noise.\n By blending with a sigmoid on log-SNR, we get an effective weighting that\n transitions smoothly: ~SNR at high noise (like epsilon) and ~SNR+1 at \n low noise (like v-pred), achieving a soft min-SNR-like truncation effect\n without explicit loss reweighting.\n \"\"\"\n gamma = 1.0 # log-SNR crossover point\n tau = 1.0 # sigmoid temperature\n sqrt_alpha = schedule[\"sqrt_alpha\"][timesteps].view(-1, 1, 1, 1)\n sqrt_one_minus_alpha = schedule[\"sqrt_one_minus_alpha\"][timesteps].view(-1, 1, 1, 1)\n alpha = schedule[\"alphas_cumprod\"][timesteps].view(-1, 1, 1, 1)\n snr = alpha / (1.0 - alpha).clamp(min=1e-8)\n log_snr = torch.log(snr)\n w = torch.sigmoid((log_snr - math.log(gamma)) / tau)\n eps_target = noise\n v_target = sqrt_alpha * noise - sqrt_one_minus_alpha * x_0\n return (1.0 - w) * eps_target + w * v_target\n\n\ndef predict_x0(model_output, x_t, timesteps, schedule):\n \"\"\"Recover x_0 from the blended prediction.\n Derived from: target = (1-w)*eps + w*v, x_t = sqrt(a)*x_0 + sqrt(1-a)*eps\n => x_0 = [(1-w + w*sqrt(a))*x_t - sqrt(1-a)*output] / [(1-w)*sqrt(a) + w]\n \"\"\"\n gamma = 1.0\n tau = 1.0\n sqrt_alpha = schedule[\"sqrt_alpha\"][timesteps].view(-1, 1, 1, 1)\n sqrt_one_minus_alpha = schedule[\"sqrt_one_minus_alpha\"][timesteps].view(-1, 1, 1, 1)\n alpha = schedule[\"alphas_cumprod\"][timesteps].view(-1, 1, 1, 1)\n snr = alpha / (1.0 - alpha).clamp(min=1e-8)\n log_snr = torch.log(snr)\n w = torch.sigmoid((log_snr - math.log(gamma)) / tau)\n numer = (1.0 - w + w * sqrt_alpha) * x_t - sqrt_one_minus_alpha * model_output\n denom = (1.0 - w) * sqrt_alpha + w\n return numer / denom.clamp(min=1e-8)\n\n",
3 "end_line": 141,
4 "filename": "diffusers-main/custom_train.py",
5 "op": "replace",
6 "start_line": 83
7}OK: Replaced lines 83..141 with 45 line(s) in diffusers-main/custom_train.py. Editable range: 83–127.
[Current file: diffusers-main/custom_train.py | editable: 83–127 | total: 501 lines]
80| # ============================================================================
81| # Prediction parameterization (EDITABLE REGION)
82| # ============================================================================
83
84 def compute_training_target(x_0, noise, timesteps, schedule):
85 """SNR-adaptive blended epsilon-v prediction.
86
87 Smoothly interpolates between epsilon prediction (at high noise / low SNR)
88 and v-prediction (at low noise / high SNR) using a sigmoid gate on log-SNR.
89
90 Motivation: epsilon prediction implicitly weights the x0-MSE by SNR(t),
91 which overweights easy low-noise timesteps. V-prediction weights by SNR+1,
92 which is more uniform but loses epsilon's advantage at high n
Tool Input
1{
2 "content": "\ndef compute_training_target(x_0, noise, timesteps, schedule):\n \"\"\"SNR-adaptive blended epsilon-v prediction.\n \n Smoothly interpolates between epsilon prediction (at high noise / low SNR)\n and v-prediction (at low noise / high SNR) using a sigmoid gate on log-SNR.\n \n Motivation: epsilon prediction implicitly weights the x0-MSE by SNR(t),\n which overweights easy low-noise timesteps. V-prediction weights by SNR+1,\n which is more uniform but loses epsilon's advantage at high noise.\n By blending with a sigmoid on log-SNR, we get an effective weighting that\n transitions smoothly: ~SNR at high noise (like epsilon) and ~SNR+1 at \n low noise (like v-pred), achieving a soft min-SNR-like truncation effect\n without explicit loss reweighting.\n \"\"\"\n gamma = 1.0\n tau = 1.0\n sqrt_alpha = schedule[\"sqrt_alpha\"][timesteps].view(-1, 1, 1, 1)\n sqrt_one_minus_alpha = schedule[\"sqrt_one_minus_alpha\"][timesteps].view(-1, 1, 1, 1)\n alpha = schedule[\"alphas_cumprod\"][timesteps].view(-1, 1, 1, 1)\n snr = alpha / (1.0 - alpha).clamp(min=1e-8)\n log_snr = torch.log(snr)\n w = torch.sigmoid((log_snr - math.log(gamma)) / tau)\n eps_target = noise\n v_target = sqrt_alpha * noise - sqrt_one_minus_alpha * x_0\n return (1.0 - w) * eps_target + w * v_target\n\n\ndef predict_x0(model_output, x_t, timesteps, schedule):\n \"\"\"Recover x_0 from the blended prediction.\n Derived from: target = (1-w)*eps + w*v, x_t = sqrt(a)*x_0 + sqrt(1-a)*eps\n => x_0 = [(1-w + w*sqrt(a))*x_t - sqrt(1-a)*output] / [(1-w)*sqrt(a) + w]\n \"\"\"\n gamma = 1.0\n tau = 1.0\n sqrt_alpha = schedule[\"sqrt_alpha\"][timesteps].view(-1, 1, 1, 1)\n sqrt_one_minus_alpha = schedule[\"sqrt_one_minus_alpha\"][timesteps].view(-1, 1, 1, 1)\n alpha = schedule[\"alphas_cumprod\"][timesteps].view(-1, 1, 1, 1)\n snr = alpha / (1.0 - alpha).clamp(min=1e-8)\n log_snr = torch.log(snr)\n w = torch.sigmoid((log_snr - math.log(gamma)) / tau)\n numer = (1.0 - w + w * sqrt_alpha) * x_t - sqrt_one_minus_alpha * model_output\n denom = (1.0 - w) * sqrt_alpha + w\n return numer / denom.clamp(min=1e-8)\n\n",
3 "end_line": 127,
4 "filename": "diffusers-main/custom_train.py",
5 "op": "replace",
6 "start_line": 83
7}OK: Replaced lines 83..127 with 45 line(s) in diffusers-main/custom_train.py. Editable range: 83–127.
[Current file: diffusers-main/custom_train.py | editable: 83–127 | total: 501 lines]
80| # ============================================================================
81| # Prediction parameterization (EDITABLE REGION)
82| # ============================================================================
83
84 def compute_training_target(x_0, noise, timesteps, schedule):
85 """SNR-adaptive blended epsilon-v prediction.
86
87 Smoothly interpolates between epsilon prediction (at high noise / low SNR)
88 and v-prediction (at low noise / high SNR) using a sigmoid gate on log-SNR.
89
90 Motivation: epsilon prediction implicitly weights the x0-MSE by SNR(t),
91 which overweights easy low-noise timesteps. V-prediction weights by SNR+1,
92 which is more uniform but loses epsilon's advantage at high n
Tool Input
1{
2 "op": "replace",
3 "filename": "diffusers-main/custom_train.py",
4 "start_line": 83,
5 "end_line": 118,
6 "content": "\ndef compute_training_target(x_0, noise, timesteps, schedule):\n \"\"\"Min-SNR weighted epsilon prediction.\n\n Achieves the min-SNR loss weighting (Hang et al. 2023) through the\n parameterization itself rather than explicit loss reweighting.\n\n At high noise (low SNR <= gamma): standard epsilon prediction\n At low noise (high SNR > gamma): scale epsilon by sqrt(gamma/SNR)\n\n The effective x0 loss weight becomes min(SNR(t), gamma), which avoids\n over-emphasizing near-clean timesteps (the weakness of pure epsilon pred)\n while retaining its strength at high-noise timesteps.\n \"\"\"\n gamma = 5.0\n alpha = schedule[\"alphas_cumprod\"][timesteps].view(-1, 1, 1, 1)\n snr = alpha / (1.0 - alpha).clamp(min=1e-8)\n # scale = 1 when SNR <= gamma, sqrt(gamma/SNR) < 1 when SNR > gamma\n scale = torch.sqrt(torch.clamp(gamma / snr, max=1.0))\n return scale * noise\n\n\ndef predict_x0(model_output, x_t, timesteps, schedule):\n \"\"\"Recover x_0 from min-SNR scaled epsilon prediction.\n\n model_output = scale(t) * epsilon\n => epsilon = model_output / scale(t)\n => x_0 = (x_t - sqrt(1-alpha) * epsilon) / sqrt(alpha)\n \"\"\"\n gamma = 5.0\n alpha = schedule[\"alphas_cumprod\"][timesteps].view(-1, 1, 1, 1)\n snr = alpha / (1.0 - alpha).clamp(min=1e-8)\n scale = torch.sqrt(torch.clamp(gamma / snr, max=1.0))\n\n # Recover epsilon from scaled prediction\n eps = model_output / scale.clamp(min=1e-5)\n\n # Standard x0 recovery from epsilon\n sqrt_alpha = schedule[\"sqrt_alpha\"][timesteps].view(-1, 1, 1, 1)\n sqrt_one_minus_alpha = schedule[\"sqrt_one_minus_alpha\"][timesteps].view(-1, 1, 1, 1)\n return (x_t - sqrt_one_minus_alpha * eps) / sqrt_alpha.clamp(min=1e-8)\n\n"
7}OK: Replaced lines 83..118 with 42 line(s) in diffusers-main/custom_train.py. Editable range: 83–124.
[Current file: diffusers-main/custom_train.py | editable: 83–124 | total: 498 lines]
80| # ============================================================================
81| # Prediction parameterization (EDITABLE REGION)
82| # ============================================================================
83
84 def compute_training_target(x_0, noise, timesteps, schedule):
85 """Min-SNR weighted epsilon prediction.
86
87 Achieves the min-SNR loss weighting (Hang et al. 2023) through the
88 parameterization itself rather than explicit loss reweighting.
89
90 At high noise (low SNR <= gamma): standard epsilon prediction
91 At low noise (high SNR > gamma): scale epsilon by sqrt(gamma/SNR)
92
93 The effective x0 loss weight becomes min(SNR(t), gamma), which avoids
94 over-emphasizing near-cl
[Test #1] (2 tests remaining; call submit(n=1) to submit this result) ### train_small (scripts/train_small.sh) Training results: Model parameters: 9.0M | GPUs: 8 step 200/35000 | loss 0.0259 | 17.2s step 400/35000 | loss 0.0154 | 16.3s step 600/35000 | loss 0.0191 | 16.6s step 800/35000 | loss 0.0146 | 16.2s step 1000/35000 | loss 0.0179 | 16.3s step 1200/35000 | loss 0.0186 | 16.1s step 1400/35000 | loss 0.0162 | 16.5s step 1600/35000 | loss 0.0138 | 16.0s step 1800/35000 | loss 0.0166 | 15.9s step 2000/35000 | loss 0.0190 | 16.1s step 2200/35000 | loss 0.0175 | 16.1s step 2400/35000 | loss 0.0163 | 16.1s step 2600/35000 | loss 0.0174 | 16.5s step 2800/35000 | loss 0.0162 | 15.9s step 3000/35000 | loss 0.0176 | 16.4s step 3200/35000 | loss 0.0155 | 16.7s step 3400/35000 | loss 0.0125 | 16.4s step 3600/35000 | loss 0.0153 | 16.4s step 3800/35000 | loss 0.0150 | 16.3s step 4000/35000 | loss 0.0138 | 16.7s step 4200/35000 | loss 0.0190 | 16.2s step 4400/35000 | loss 0.0159 | 16.2s step
Tool Input
1{
2 "content": "\ndef compute_training_target(x_0, noise, timesteps, schedule):\n \"\"\"Adaptive blended prediction: smoothly interpolates between epsilon\n prediction (at high noise) and v-prediction (at low noise) based on SNR.\n\n Motivation: Epsilon prediction excels at high-noise timesteps but has\n high variance gradients at low-noise timesteps (high SNR), where the\n signal-to-noise is high and epsilon is hard to extract. V-prediction\n provides more stable gradients at low noise. By blending with a sigmoid\n on log-SNR, we get the best of both parameterizations.\n\n target = (1 - w(t)) * epsilon + w(t) * v\n where w(t) = sigmoid((log_snr - mu) / tau) smoothly transitions\n \"\"\"\n # Hardcoded hyperparameters\n mu = 3.0 # log-SNR threshold for transition (SNR ~ 20)\n tau = 1.0 # temperature of sigmoid transition\n\n sqrt_alpha = schedule[\"sqrt_alpha\"][timesteps].view(-1, 1, 1, 1)\n sqrt_one_minus_alpha = schedule[\"sqrt_one_minus_alpha\"][timesteps].view(-1, 1, 1, 1)\n alpha = schedule[\"alphas_cumprod\"][timesteps].view(-1, 1, 1, 1)\n\n # Compute v-prediction target\n v = sqrt_alpha * noise - sqrt_one_minus_alpha * x_0\n\n # Compute blending weight based on log-SNR\n snr = alpha / (1.0 - alpha).clamp(min=1e-8)\n log_snr = torch.log(snr.clamp(min=1e-8))\n w = torch.sigmoid((log_snr - mu) / tau)\n\n # Blend: mostly epsilon at high noise, mostly v at low noise\n return (1.0 - w) * noise + w * v\n\n\ndef predict_x0(model_output, x_t, timesteps, schedule):\n \"\"\"Recover x_0 from the blended prediction target.\n\n target = (1-w)*eps + w*v = (1-w)*eps + w*(sqrt_alpha*eps - sqrt_1ma*x0)\n = ((1-w) + w*sqrt_alpha)*eps - w*sqrt_1ma*x0\n\n Using x_t = sqrt_alpha*x0 + sqrt_1ma*eps:\n eps = (x_t - sqrt_alpha*x0) / sqrt_1ma\n\n Substituting and solving for x0:\n output = ((1-w) + w*sa) * (x_t - sa*x0)/s1ma - w*s1ma*x0\n output = ((1-w)+w*sa)*x_t/s1ma - x0*[((1-w)+w*sa)*sa/s1ma + w*s1ma]\n x0 = (((1-w)+w*sa)*x_t/s1ma - output) / (((1-w)+w*sa)*sa/s1ma + w*s1ma)\n \"\"\"\n mu = 3.0\n tau = 1.0\n\n sqrt_alpha = schedule[\"sqrt_alpha\"][timesteps].view(-1, 1, 1, 1)\n sqrt_one_minus_alpha = schedule[\"sqrt_one_minus_alpha\"][timesteps].view(-1, 1, 1, 1)\n alpha = schedule[\"alphas_cumprod\"][timesteps].view(-1, 1, 1, 1)\n\n snr = alpha / (1.0 - alpha).clamp(min=1e-8)\n log_snr = torch.log(snr.clamp(min=1e-8))\n w = torch.sigmoid((log_snr - mu) / tau)\n\n sa = sqrt_alpha\n s1ma = sqrt_one_minus_alpha.clamp(min=1e-8)\n\n # Coefficients: target = A * eps + B * x0\n # where A = (1-w) + w*sa, B = -w*s1ma\n A = (1.0 - w) + w * sa\n # Using eps = (x_t - sa*x0)/s1ma:\n # target = A*(x_t - sa*x0)/s1ma + B*x0\n # target = A*x_t/s1ma - x0*(A*sa/s1ma - B)\n # x0 = (A*x_t/s1ma - target) / (A*sa/s1ma + w*s1ma)\n\n coeff_xt = A / s1ma\n coeff_denom = A * sa / s1ma + w * s1ma\n\n return (coeff_xt * x_t - model_output) / coeff_denom.clamp(min=1e-5)\n\n",
3 "end_line": 124,
4 "filename": "diffusers-main/custom_train.py",
5 "op": "replace",
6 "start_line": 83
7}OK: Replaced lines 83..124 with 75 line(s) in diffusers-main/custom_train.py. Editable range: 83–157.
[Current file: diffusers-main/custom_train.py | editable: 83–157 | total: 531 lines]
80| # ============================================================================
81| # Prediction parameterization (EDITABLE REGION)
82| # ============================================================================
83
84 def compute_training_target(x_0, noise, timesteps, schedule):
85 """Adaptive blended prediction: smoothly interpolates between epsilon
86 prediction (at high noise) and v-prediction (at low noise) based on SNR.
87
88 Motivation: Epsilon prediction excels at high-noise timesteps but has
89 high variance gradients at low-noise timesteps (high SNR), where the
90 signal-to-noise is high and epsilon is hard to extract. V-prediction
91 provides more stable gradients at low noise. By blending with a sigmoid
[Test #2] (1 test remaining; call submit(n=2) to submit this result) ### train_small (scripts/train_small.sh) Training results: Model parameters: 9.0M | GPUs: 8 step 200/35000 | loss 0.0355 | 17.4s step 400/35000 | loss 0.0328 | 16.5s step 600/35000 | loss 0.0392 | 16.5s step 800/35000 | loss 0.0278 | 16.2s step 1000/35000 | loss 0.0415 | 16.2s step 1200/35000 | loss 0.0345 | 16.4s step 1400/35000 | loss 0.0326 | 17.1s step 1600/35000 | loss 0.0234 | 16.1s step 1800/35000 | loss 0.0329 | 16.5s step 2000/35000 | loss 0.0447 | 16.4s step 2200/35000 | loss 0.0437 | 16.6s step 2400/35000 | loss 0.0283 | 16.1s step 2600/35000 | loss 0.0484 | 17.1s step 2800/35000 | loss 0.0291 | 16.2s step 3000/35000 | loss 0.0357 | 16.4s step 3200/35000 | loss 0.0359 | 16.7s step 3400/35000 | loss 0.0332 | 16.3s step 3600/35000 | loss 0.0310 | 16.3s step 3800/35000 | loss 0.0361 | 16.6s step 4000/35000 | loss 0.0238 | 16.3s step 4200/35000 | loss 0.0361 | 16.1s step 4400/35000 | loss 0.0213 | 16.4s step 4