Agent Conversation: google/gemini-3.1-pro-preview

Model: google/gemini-3.1-pro-preview
Total steps: 9
Messages: 40
UserStep 0

Task: cv-diffusion-cfg

Diffusion Model: Classifier-Free Guidance Optimization

Objective

Improve text-to-image generation quality by designing a better classifier-free guidance formula. Your contribution should be a new way to combine noise_uc and noise_c (and optionally their history / derivatives) into a guided noise prediction — not a new numerical sampler.

The evaluation metric is FID (lower is better). CLIP score is logged for diagnostic purposes but is not part of the evaluation — a solution that improves CLIP at the cost of FID is a regression and will not be accepted.

Background

Classifier-free guidance (CFG) is a fundamental technique in diffusion models for text-guided generation. The standard CFG formula is:

noise_pred = noise_uc + cfg_guidance * (noise_c - noise_uc)

where noise_uc is the unconditional noise prediction and noise_c is the conditional (text-guided) noise prediction.

Standard CFG has known limitations:

  • Requires high guidance scales (typically 7.5-12.5)
  • Can cause mode collapse and saturation
  • Results in curved, unnatural sampling trajectories
  • Poor invertibility

Recent work (CFG++, Rescaled CFG, Perp-Neg, Interval CFG, Dynamic CFG, etc.) has shown that rethinking the guidance mechanism — while keeping the underlying sampler fixed — can substantially improve text-image alignment without regressing on sample quality.

Task

Redesign the guidance mechanism inside BaseDDIMCFGpp.sample() (SD v1.5) and BaseDDIMCFGpp.reverse_process() (SDXL). Your contribution must be a new way to compute the guided noise prediction from noise_uc and noise_c, not a new way to integrate the diffusion ODE.

Hard rules — read carefully

  1. The sampler stays DDIM. Every step must be a first-order DDIM update: one predict_noise call → Tweedie's formula → standard DDIM update rule (zt_next = alpha_next.sqrt() * z0t + (1 - alpha_next).sqrt() * noise_for_renoise). You may not:

    • switch to Euler / Heun / RK / DPM-Solver / DPM++ / UniPC / any higher-order or multistep numerical integrator
    • introduce history buffers of past denoised / noise_pred predictions to do polynomial extrapolation
    • add SDE Langevin noise terms
    • use adaptive step sizing This is not a sampler efficiency task — see cv-diffusion-efficiency if you want to design solvers.
  2. No image post-processing tricks. You may not apply dynamic thresholding, percentile clipping, amplitude clamping, momentum / EMA smoothing, or any per-sample normalization to z0t / zt. These change the output distribution but do not improve the guidance formula.

  3. Your contribution lives in the noise combination. Legal moves include (but are not limited to):

    • New functional forms for how noise_uc and noise_c are combined (e.g. rescaled CFG, perpendicular projection, std normalization)
    • Timestep-dependent guidance schedules (interval CFG,
AssistantStep 1edit
Tool Input
1{
2  "end_line": 682,
3  "content": "    @torch.autocast(device_type='cuda', dtype=torch.float16)\n    def sample(self,\n               cfg_guidance=7.5,\n               prompt=[\"\",\"\"],\n               callback_fn=None,\n               **kwargs):\n        # NPOG (Norm-Preserving Orthogonal Guidance) natural scale\n        cfg_guidance = 7.5\n\n        # Text embedding\n        uc, c = self.get_text_embed(null_prompt=prompt[0], prompt=prompt[1])\n\n        # Initialize zT\n        zt = self.initialize_latent()\n        zt = zt.requires_grad_()\n\n        # Sampling\n        pbar = tqdm(self.scheduler.timesteps, desc=\"SD\")\n        for step, t in enumerate(pbar):\n            at = self.alpha(t)\n            at_prev = self.alpha(t - self.skip)\n\n            with torch.no_grad():\n                noise_uc, noise_c = self.predict_noise(zt, t, uc, c)\n                \n                dims = list(range(1, noise_c.ndim))\n                \n                # 1. Project noise_uc onto noise_c\n                dot_uc_c = (noise_uc * noise_c).sum(dim=dims, keepdim=True)\n                norm_c2 = (noise_c * noise_c).sum(dim=dims, keepdim=True)\n                \n                uc_proj_c = (dot_uc_c / (norm_c2 + 1e-8)) * noise_c\n                uc_ortho = noise_uc - uc_proj_c\n                \n                # 2. Guidance: subtract orthogonal unconditional component\n                noise_pred = noise_c - cfg_guidance * uc_ortho\n                \n                # 3. Norm preservation\n                norm_pred2 = (noise_pred * noise_pred).sum(dim=dims, keepdim=True)\n                noise_pred = noise_pred * torch.sqrt(norm_c2 / (norm_pred2 + 1e-8))\n\n            # tweedie\n            z0t = (zt - (1-at).sqrt() * noise_pred) / at.sqrt()\n\n            # renoise\n            zt = at_prev.sqrt() * z0t + (1-at_prev).sqrt() * noise_pred\n\n            if callback_fn is not None:\n                callback_kwargs = {'z0t': z0t.detach(),\n                                    'zt': zt.detach(),\n                                    'decode': self.decode}\n                callback_kwargs = callback_fn(step, t, callback_kwargs)\n                z0t = callback_kwargs[\"z0t\"]\n                zt = callback_kwargs[\"zt\"]\n\n        # for the last step, do not add noise\n        img = self.decode(z0t)\n        img = (img / 2 + 0.5).clamp(0, 1)\n        return img.detach().cpu()",
4  "start_line": 651,
5  "op": "replace",
6  "filename": "CFGpp-main/latent_diffusion.py"
7}
Tool ResultStep 1
OK: Replaced lines 651..682 with 59 line(s) in CFGpp-main/latent_diffusion.py. Editable range: 624–709.

[Current file: CFGpp-main/latent_diffusion.py | editable: 624–709 | total: 1048 lines]
   621| @register_solver("ddim_cfg++")
   622| class BaseDDIMCFGpp(StableDiffusion):
   623|     # TODO: Implement your improved method here.
   624      #
   625      # Your goal is to improve the CFG mechanism to achieve better text-image
   626      # alignment (measured by CLIP score) while maintaining or improving sample quality.
   627      #
   628      # Key methods you need to implement:
   629      # - __init__: Initialize the solver
   630      # - sample: Main sampling function that generates images
   631      #
   632      # Available helper methods from parent class:
   633      # - self.get_text_embed(null_prompt, prompt): Get text embeddings
   634      # - self.initialize_latent(): Initialize latent variable zT
   635      # - self.predict_noise(zt, t, uc, c): Predict noise at ti
AssistantStep 2edit
Tool Input
1{
2  "start_line": 734,
3  "op": "replace",
4  "content": "    def reverse_process(self,\n                        null_prompt_embeds,\n                        prompt_embeds,\n                        cfg_guidance,\n                        add_cond_kwargs,\n                        shape=(1024, 1024),\n                        callback_fn=None,\n                        **kwargs):\n        # NPOG (Norm-Preserving Orthogonal Guidance) natural scale\n        cfg_guidance = 7.5\n\n        zt = self.initialize_latent(size=(1, 4, shape[1] // self.vae_scale_factor, shape[0] // self.vae_scale_factor))\n\n        pbar = tqdm(self.scheduler.timesteps.int(), desc='SDXL')\n        for step, t in enumerate(pbar):\n            next_t = t - self.skip\n            at = self.scheduler.alphas_cumprod[t]\n            at_next = self.scheduler.alphas_cumprod[next_t]\n\n            with torch.no_grad():\n                noise_uc, noise_c = self.predict_noise(zt, t, null_prompt_embeds, prompt_embeds, add_cond_kwargs)\n                \n                dims = list(range(1, noise_c.ndim))\n                \n                # 1. Project noise_uc onto noise_c\n                dot_uc_c = (noise_uc * noise_c).sum(dim=dims, keepdim=True)\n                norm_c2 = (noise_c * noise_c).sum(dim=dims, keepdim=True)\n                \n                uc_proj_c = (dot_uc_c / (norm_c2 + 1e-8)) * noise_c\n                uc_ortho = noise_uc - uc_proj_c\n                \n                # 2. Guidance: subtract orthogonal unconditional component\n                noise_pred = noise_c - cfg_guidance * uc_ortho\n                \n                # 3. Norm preservation\n                norm_pred2 = (noise_pred * noise_pred).sum(dim=dims, keepdim=True)\n                noise_pred = noise_pred * torch.sqrt(norm_c2 / (norm_pred2 + 1e-8))\n\n            z0t = (zt - (1-at).sqrt() * noise_pred) / at.sqrt()\n\n            # renoise\n            zt = at_next.sqrt() * z0t + (1-at_next).sqrt() * noise_pred\n\n            if callback_fn is not None:\n                callback_kwargs = {'z0t': z0t.detach(),\n                                    'zt': zt.detach(),\n                                    'decode': self.decode}\n                callback_kwargs = callback_fn(step, t, callback_kwargs)\n                z0t = callback_kwargs[\"z0t\"]\n                zt = callback_kwargs[\"zt\"]\n\n        return z0t",
5  "end_line": 755,
6  "filename": "CFGpp-main/latent_sdxl.py"
7}
Tool ResultStep 2
OK: Replaced lines 734..755 with 52 line(s) in CFGpp-main/latent_sdxl.py. Editable range: 713–785.

[Current file: CFGpp-main/latent_sdxl.py | editable: 713–785 | total: 1061 lines]
   710| # CFG++ version
   711| ###########################################
   712| 
   713  @register_solver("ddim_cfg++")
   714  class BaseDDIMCFGpp(SDXL):
   715      # TODO: Implement your improved method here.
   716      #
   717      # Your goal is to improve the CFG mechanism for SDXL to achieve better
   718      # text-image alignment (measured by CLIP score).
   719      #
   720      # Key methods you need to implement:
   721      # - reverse_process: Main sampling function that generates latents
   722      #
   723      # Available helper methods from parent class:
   724      # - self.initialize_latent(size=(1, 4, H//vae_scale, W//vae_scale))
   725      # - self.predict_noise(zt, t, null_prompt_embeds, prompt_embeds, add_cond_kwargs)
   726      # - self.scheduler.alphas_cumprod[t]: Get al
AssistantStep 3test
AssistantStep 1edit
Tool Input
1{
2  "filename": "CFGpp-main/latent_diffusion.py",
3  "op": "replace",
4  "end_line": 680,
5  "start_line": 657,
6  "content": "        cfg_guidance = 7.5\n        rescale_phi = 0.7\n        K = 2\n\n        uc, c = self.get_text_embed(null_prompt=prompt[0], prompt=prompt[1])\n\n        zt = self.initialize_latent()\n        zt = zt.requires_grad_()\n\n        pbar = tqdm(self.scheduler.timesteps, desc=\"SD\")\n        for step, t in enumerate(pbar):\n            at = self.alpha(t)\n            at_prev = self.alpha(t - self.skip)\n\n            with torch.no_grad():\n                noise_uc, noise_c = self.predict_noise(zt, t, uc, c)\n\n                w = 0.0 if step < K else cfg_guidance\n                noise_pred = noise_uc + w * (noise_c - noise_uc)\n\n                if w > 0:\n                    std_c = noise_c.std(dim=list(range(1, noise_c.ndim)), keepdim=True)\n                    std_pred = noise_pred.std(dim=list(range(1, noise_pred.ndim)), keepdim=True)\n                    noise_pred_rescaled = noise_pred * (std_c / std_pred)\n                    noise_pred = rescale_phi * noise_pred_rescaled + (1 - rescale_phi) * noise_pred\n\n            z0t = (zt - (1-at).sqrt() * noise_pred) / at.sqrt()\n            zt = at_prev.sqrt() * z0t + (1-at_prev).sqrt() * noise_pred\n\n            if callback_fn is not None:\n                callback_kwargs = {'z0t': z0t.detach(),\n                                    'zt': zt.detach(),\n                                    'decode': self.decode}\n                callback_kwargs = callback_fn(step, t, callback_kwargs)\n                z0t = callback_kwargs[\"z0t\"]\n                zt = callback_kwargs[\"zt\"]\n\n        img = self.decode(z0t)"
7}
Tool ResultStep 1
OK: Replaced lines 657..680 with 38 line(s) in CFGpp-main/latent_diffusion.py. Editable range: 624–696.

[Current file: CFGpp-main/latent_diffusion.py | editable: 624–696 | total: 1035 lines]
   621| @register_solver("ddim_cfg++")
   622| class BaseDDIMCFGpp(StableDiffusion):
   623|     # TODO: Implement your improved method here.
   624      #
   625      # Your goal is to improve the CFG mechanism to achieve better text-image
   626      # alignment (measured by CLIP score) while maintaining or improving sample quality.
   627      #
   628      # Key methods you need to implement:
   629      # - __init__: Initialize the solver
   630      # - sample: Main sampling function that generates images
   631      #
   632      # Available helper methods from parent class:
   633      # - self.get_text_embed(null_prompt, prompt): Get text embeddings
   634      # - self.initialize_latent(): Initialize latent variable zT
   635      # - self.predict_noise(zt, t, uc, c): Predict noise at ti
AssistantStep 2edit
Tool Input
1{
2  "filename": "CFGpp-main/latent_sdxl.py",
3  "start_line": 742,
4  "end_line": 755,
5  "content": "        cfg_guidance = 7.5\n        rescale_phi = 0.7\n        K = 2\n\n        zt = self.initialize_latent(size=(1, 4, shape[1] // self.vae_scale_factor, shape[0] // self.vae_scale_factor))\n\n        pbar = tqdm(self.scheduler.timesteps.int(), desc='SDXL')\n        for step, t in enumerate(pbar):\n            next_t = t - self.skip\n            at = self.scheduler.alphas_cumprod[t]\n            at_next = self.scheduler.alphas_cumprod[next_t]\n\n            with torch.no_grad():\n                noise_uc, noise_c = self.predict_noise(zt, t, null_prompt_embeds, prompt_embeds, add_cond_kwargs)\n                \n                w = 0.0 if step < K else cfg_guidance\n                noise_pred = noise_uc + w * (noise_c - noise_uc)\n\n                if w > 0:\n                    std_c = noise_c.std(dim=list(range(1, noise_c.ndim)), keepdim=True)\n                    std_pred = noise_pred.std(dim=list(range(1, noise_pred.ndim)), keepdim=True)\n                    noise_pred_rescaled = noise_pred * (std_c / std_pred)\n                    noise_pred = rescale_phi * noise_pred_rescaled + (1 - rescale_phi) * noise_pred\n\n            z0t = (zt - (1-at).sqrt() * noise_pred) / at.sqrt()\n            zt = at_next.sqrt() * z0t + (1-at_next).sqrt() * noise_pred\n\n            if callback_fn is not None:\n                callback_kwargs = {'z0t': z0t.detach(), 'zt': zt.detach(), 'decode': self.decode}\n                callback_kwargs = callback_fn(step, t, callback_kwargs)\n                z0t = callback_kwargs[\"z0t\"]\n                zt = callback_kwargs[\"zt\"]\n\n        return z0t",
6  "op": "replace"
7}
Tool ResultStep 2
OK: Replaced lines 742..755 with 34 line(s) in CFGpp-main/latent_sdxl.py. Editable range: 713–775.

[Current file: CFGpp-main/latent_sdxl.py | editable: 713–775 | total: 1051 lines]
   710| # CFG++ version
   711| ###########################################
   712| 
   713  @register_solver("ddim_cfg++")
   714  class BaseDDIMCFGpp(SDXL):
   715      # TODO: Implement your improved method here.
   716      #
   717      # Your goal is to improve the CFG mechanism for SDXL to achieve better
   718      # text-image alignment (measured by CLIP score).
   719      #
   720      # Key methods you need to implement:
   721      # - reverse_process: Main sampling function that generates latents
   722      #
   723      # Available helper methods from parent class:
   724      # - self.initialize_latent(size=(1, 4, H//vae_scale, W//vae_scale))
   725      # - self.predict_noise(zt, t, null_prompt_embeds, prompt_embeds, add_cond_kwargs)
   726      # - self.scheduler.alphas_cumprod[t]: Get al
AssistantStep 3test
Tool ResultStep 3
[Test #1] (2 tests remaining; call submit(n=1) to submit this result)

### sd15 (scripts/eval_sd15.sh)
Generation results:
GENERATION_METRICS model=sd15 method=ddim_cfg++ cfg_guidance=7.5 NFE=50 seed=42 fid=22.7413
  sd15: FID=22.7413
Average FID: 22.7413

### sd20 (scripts/eval_sd20.sh)
Generation results:
GENERATION_METRICS model=sd20 method=ddim_cfg++ cfg_guidance=7.5 NFE=50 seed=42 fid=23.3490
  sd20: FID=23.3490
Average FID: 23.3490

[Leaderboard] Results saved: {'fid_sd15': 22.7413, 'fid': 23.349, 'elapsed_sd15': 930, 'fid_sd20': 23.349, 'elapsed_sd20': 909}
AssistantStep 4edit
Tool Input
1{
2  "op": "replace",
3  "start_line": 683,
4  "filename": "CFGpp-main/latent_diffusion.py",
5  "content": "            z0t = (zt - (1-at).sqrt() * noise_pred) / at.sqrt()\n            zt = at_prev.sqrt() * z0t + (1-at_prev).sqrt() * noise_c",
6  "end_line": 684
7}
Tool ResultStep 4
OK: Replaced lines 683..684 with 2 line(s) in CFGpp-main/latent_diffusion.py. Editable range: 624–696.

[Current file: CFGpp-main/latent_diffusion.py | editable: 624–696 | total: 1035 lines]
   621| @register_solver("ddim_cfg++")
   622| class BaseDDIMCFGpp(StableDiffusion):
   623|     # TODO: Implement your improved method here.
   624      #
   625      # Your goal is to improve the CFG mechanism to achieve better text-image
   626      # alignment (measured by CLIP score) while maintaining or improving sample quality.
   627      #
   628      # Key methods you need to implement:
   629      # - __init__: Initialize the solver
   630      # - sample: Main sampling function that generates images
   631      #
   632      # Available helper methods from parent class:
   633      # - self.get_text_embed(null_prompt, prompt): Get text embeddings
   634      # - self.initialize_latent(): Initialize latent variable zT
   635      # - self.predict_noise(zt, t, uc, c): Predict noise at tim
AssistantStep 5edit
Tool Input
1{
2  "op": "replace",
3  "start_line": 766,
4  "end_line": 767,
5  "filename": "CFGpp-main/latent_sdxl.py",
6  "content": "            z0t = (zt - (1-at).sqrt() * noise_pred) / at.sqrt()\n            zt = at_next.sqrt() * z0t + (1-at_next).sqrt() * noise_c"
7}
Tool ResultStep 5
OK: Replaced lines 766..767 with 2 line(s) in CFGpp-main/latent_sdxl.py. Editable range: 713–775.

[Current file: CFGpp-main/latent_sdxl.py | editable: 713–775 | total: 1051 lines]
   710| # CFG++ version
   711| ###########################################
   712| 
   713  @register_solver("ddim_cfg++")
   714  class BaseDDIMCFGpp(SDXL):
   715      # TODO: Implement your improved method here.
   716      #
   717      # Your goal is to improve the CFG mechanism for SDXL to achieve better
   718      # text-image alignment (measured by CLIP score).
   719      #
   720      # Key methods you need to implement:
   721      # - reverse_process: Main sampling function that generates latents
   722      #
   723      # Available helper methods from parent class:
   724      # - self.initialize_latent(size=(1, 4, H//vae_scale, W//vae_scale))
   725      # - self.predict_noise(zt, t, null_prompt_embeds, prompt_embeds, add_cond_kwargs)
   726      # - self.scheduler.alphas_cumprod[t]: Get alp
AssistantStep 6test
Tool ResultStep 6
[Test #2] (1 test remaining; call submit(n=2) to submit this result)

### sd15 (scripts/eval_sd15.sh)
Generation results:
GENERATION_METRICS model=sd15 method=ddim_cfg++ cfg_guidance=7.5 NFE=50 seed=42 fid=53.2451
  sd15: FID=53.2451
Average FID: 53.2451

### sd20 (scripts/eval_sd20.sh)
Generation results:
GENERATION_METRICS model=sd20 method=ddim_cfg++ cfg_guidance=7.5 NFE=50 seed=42 fid=50.6744
  sd20: FID=50.6744
Average FID: 50.6744

[Leaderboard] Results saved: {'fid_sd15': 53.2451, 'fid': 50.6744, 'elapsed_sd15': 929, 'fid_sd20': 50.6744, 'elapsed_sd20': 925}
AssistantStep 7undo
Tool Input
1{
2  "n": 2
3}
Tool ResultStep 7
Undo complete:
Restored: CFGpp-main/latent_sdxl.py
Restored: CFGpp-main/latent_diffusion.py