rl-offpolicy-sample-efficiency

Reinforcement LearningFastTD3

Description

Off-Policy RL Sample Efficiency: Algorithm Design for Humanoid Locomotion

Objective

Design a more sample-efficient off-policy reinforcement learning algorithm that achieves higher performance than FastTD3, FastSAC, and PPO on humanoid locomotion tasks within the same training budget.

Background

FastTD3 is a high-performance off-policy RL algorithm that combines TD3 with distributional value estimation (categorical DQN with 101 atoms), parallel environments (128 envs), observation normalization, mixed-precision training, and torch.compile for speed. It uses a deterministic actor with Gaussian exploration noise and twin distributional critics with clipped double Q-learning.

Your task is to design an improved off-policy algorithm by modifying the Actor, Critic, and/or update functions. The training infrastructure (environment, replay buffer, evaluation) is fixed.

What You Can Modify

The editable section contains:

Actor: Network architecture, forward pass, exploration strategy
Critic: Q-network architecture, distributional parameters, ensemble design
build_algorithm(): Component construction, optimizers, schedulers, auxiliary modules
update_critic(): Critic loss computation, target calculation, auxiliary objectives
update_actor(): Policy gradient objective, entropy regularization, etc.
soft_update(): Target network update strategy

Key Design Dimensions

Architecture: LayerNorm, spectral norm, residual connections, different activations
Exploration: Noise schedule, parameter-space noise, curiosity, optimistic exploration
Value estimation: Distributional RL (atoms, quantiles), ensemble methods, uncertainty
Policy optimization: Entropy regularization, policy constraints, advantage weighting
Sample reuse: Update-to-data ratio, replay prioritization, n-step returns
Representation: Feature normalization, auxiliary losses, self-predictive representations

Constraints

The algorithm must work with continuous action spaces (actions clipped to [-1, 1])
Must use the provided replay buffer and environment interface
Total training budget: 100,000 gradient steps with 128 parallel environments
Must produce deterministic actions at evaluation time via actor(obs)

Evaluation

The algorithm is evaluated on three HumanoidBench locomotion tasks:

h1hand-stand-v0: Humanoid standing balance
h1hand-walk-v0: Humanoid walking
h1hand-run-v0: Humanoid running

Performance is measured as mean episode return over 3 evaluation rollouts at the end of training. Higher is better.

Code

custom_algorithm.py

EditableRead-only

1"""Custom off-policy RL algorithm for HumanoidBench locomotion tasks.
2
3This script is adapted from FastTD3's training pipeline. The EDITABLE section
4contains the full algorithm: Actor, Critic, update functions, and exploration
5strategy. The FIXED sections handle environment setup, evaluation, replay buffer
6infrastructure, and metric printing.
7
8The agent should design a sample-efficient off-policy (or hybrid) RL algorithm
9that outperforms FastTD3, FastSAC, and PPO on humanoid locomotion tasks.
10"""
11
12import os
13import sys
14
15os.environ["TORCHDYNAMO_INLINE_INBUILT_NN_MODULES"] = "1"

Results

No results available yet.