Humanoid Robot Sim2Real: Reward Function Design

Objective

Design novel reward functions for humanoid robot locomotion that achieve robust sim-to-real transfer. You will implement custom reward functions in humanoid_env_custom.py that encourage natural, stable gaits capable of following diverse velocity commands.

Research Question

What reward function implementations lead to policies that can successfully execute diverse locomotion commands in sim2sim transfer (Isaac Gym → MuJoCo)?

The key challenge: reward functions that work well for a single command often fail when the robot needs to follow varied commands (different speeds, directions, turning rates). Your reward functions should encourage:

Robust tracking of velocity commands (vx, vy, dyaw)
Natural, stable gaits across different speeds
Smooth transitions between commands
Energy-efficient movement
Sim-to-sim transfer robustness

Background

The humanoid locomotion task requires the robot to track 3D velocity commands:

vx: forward/backward velocity (m/s)
vy: lateral velocity (m/s)
dyaw: yaw angular velocity (rad/s)

Existing reward functions (in humanoid_env.py) include:

_reward_tracking_lin_vel(): Track linear velocity commands
_reward_tracking_ang_vel(): Track angular velocity commands
_reward_feet_clearance(): Encourage foot lifting during swing
_reward_foot_slip(): Penalize foot slipping
_reward_orientation(): Keep torso upright
_reward_base_height(): Maintain target height
_reward_action_smoothness(): Encourage smooth actions
... and 15+ more reward terms

The problem: These reward functions are designed and tuned for specific command ranges. When tested on diverse commands, policies often:

Fail to track commands outside the training distribution
Exhibit unstable gaits at certain speeds
Have poor transitions between different commands
Fail to transfer from Isaac Gym to MuJoCo

Task

Implement custom reward functions in the EDITABLE SECTIONS:

1. Environment file: humanoid_env_custom.py (lines 76-540)

In __init__ method (lines 76-81):

Initialize custom tracking buffers (e.g., self.my_custom_buffer = torch.zeros(...))
Track additional data needed for your reward functions
Example: self.feet_height, self.last_feet_z for clearance rewards

In reward functions (lines 272-540):

Modify existing reward function implementations
Add new reward functions
Change reward computation logic
Design adaptive rewards based on command magnitude
Use any available state tensors: self.dof_pos, self.base_lin_vel, self.contact_forces, etc.

2. Config file: humanoid_config_custom.py (lines 174-216)

In rewards class:

Adjust reward parameters (e.g., base_height_target, cycle_time, tracking_sigma)
Modify reward scales in the scales subclass to match your reward functions
Add new scale entries for new reward functions
Remove scales for unused reward functions
Balance reward contributions (e.g., tracking_lin_vel = 1.2, foot_slip = -0.05)

Important: The reward scales in the config must match the reward functions in the environment. If you add a new reward function _reward_my_custom(), you must add a corresponding scale my_custom = <value> in the config.

Fixed components:

Network architecture: 3-layer MLP
Training: PPO with 10000 iterations
Observation/action spaces

Evaluation

Trained in Isaac Gym (4096 parallel envs, 10000 iterations), then evaluated on 100 diverse random commands in MuJoCo (sim2sim transfer):

Test procedure:

Sample 100 random commands from ranges:
- vx: [-0.5, 1.0] m/s
- vy: [-0.4, 0.4] m/s
- dyaw: [-0.5, 0.5] rad/s
For each command, run 10-second episode in MuJoCo
Success criteria:
- Robot doesn't fall (base height > 0.3m, tilt < 0.5 rad)
- Velocity tracking error < 0.3 (combined linear + angular)
- Stable for at least 80% of steps

Metrics:

success_rate: Percentage of commands successfully executed (primary metric)
avg_vel_error: Average velocity tracking error across all trials
fall_rate: Percentage of trials where robot fell

Success criteria: Good reward functions should achieve:

High success rate (>70%) across diverse commands
Low tracking error (<0.3 combined)
Low fall rate (<20%)
Robust sim2sim transfer (Isaac Gym → MuJoCo)

Reference Implementation

One baseline is provided:

default: Official humanoid-gym reward functions (unmodified)

Current best practice from the paper
Baseline for comparison
Uses standard reward implementations with custom buffer tracking

Hints

Custom buffers: Initialize tracking buffers in __init__ for rewards that need history (e.g., air time, clearance)
Command-adaptive rewards: Scale rewards based on command magnitude
Balance: Trade-off between tracking accuracy and stability
Smooth rewards: Work better than sparse rewards
Sim2sim robustness: Rewards that encourage natural gaits transfer better
Consider adding rewards for:
- Smooth command following (penalize oscillations)
- Gait consistency across speeds
- Robustness to command changes
- Natural foot placement patterns

robo-humanoid-sim2real-reward

Description