robo-humanoid-sim2real-reward

Otherhumanoid-gymrigorous codebase

Description

Humanoid Robot Sim2Real: Reward Function Design

Objective

Design novel reward functions for humanoid robot locomotion that achieve robust sim-to-real transfer. You will implement custom reward functions in humanoid_env_custom.py that encourage natural, stable gaits capable of following diverse velocity commands.

Research Question

What reward function implementations lead to policies that can successfully execute diverse locomotion commands in sim2sim transfer (Isaac Gym → MuJoCo)?

The key challenge: reward functions that work well for a single command often fail when the robot needs to follow varied commands (different speeds, directions, turning rates). Your reward functions should encourage:

  • Robust tracking of velocity commands (vx, vy, dyaw)
  • Natural, stable gaits across different speeds
  • Smooth transitions between commands
  • Energy-efficient movement
  • Sim-to-sim transfer robustness

Background

The humanoid locomotion task requires the robot to track 3D velocity commands:

  • vx: forward/backward velocity (m/s)
  • vy: lateral velocity (m/s)
  • dyaw: yaw angular velocity (rad/s)

Existing reward functions (in humanoid_env.py) include:

  • _reward_tracking_lin_vel(): Track linear velocity commands
  • _reward_tracking_ang_vel(): Track angular velocity commands
  • _reward_feet_clearance(): Encourage foot lifting during swing
  • _reward_foot_slip(): Penalize foot slipping
  • _reward_orientation(): Keep torso upright
  • _reward_base_height(): Maintain target height
  • _reward_action_smoothness(): Encourage smooth actions
  • ... and 15+ more reward terms

The problem: These reward functions are designed and tuned for specific command ranges. When tested on diverse commands, policies often:

  • Fail to track commands outside the training distribution
  • Exhibit unstable gaits at certain speeds
  • Have poor transitions between different commands
  • Fail to transfer from Isaac Gym to MuJoCo

Task

Implement custom reward functions in the EDITABLE SECTIONS:

1. Environment file: humanoid_env_custom.py (lines 76-540)

In __init__ method (lines 76-81):

  • Initialize custom tracking buffers (e.g., self.my_custom_buffer = torch.zeros(...))
  • Track additional data needed for your reward functions
  • Example: self.feet_height, self.last_feet_z for clearance rewards

In reward functions (lines 272-540):

  • Modify existing reward function implementations
  • Add new reward functions
  • Change reward computation logic
  • Design adaptive rewards based on command magnitude
  • Use any available state tensors: self.dof_pos, self.base_lin_vel, self.contact_forces, etc.

2. Config file: humanoid_config_custom.py (lines 174-216)

In rewards class:

  • Adjust reward parameters (e.g., base_height_target, cycle_time, tracking_sigma)
  • Modify reward scales in the scales subclass to match your reward functions
  • Add new scale entries for new reward functions
  • Remove scales for unused reward functions
  • Balance reward contributions (e.g., tracking_lin_vel = 1.2, foot_slip = -0.05)

Important: The reward scales in the config must match the reward functions in the environment. If you add a new reward function _reward_my_custom(), you must add a corresponding scale my_custom = <value> in the config.

Fixed components:

  • Network architecture: 3-layer MLP
  • Training: PPO with 10000 iterations
  • Observation/action spaces

Evaluation

Trained in Isaac Gym (4096 parallel envs, 10000 iterations), then evaluated on 100 diverse random commands in MuJoCo (sim2sim transfer):

Test procedure:

  1. Sample 100 random commands from ranges:
    • vx: [-0.5, 1.0] m/s
    • vy: [-0.4, 0.4] m/s
    • dyaw: [-0.5, 0.5] rad/s
  2. For each command, run 10-second episode in MuJoCo
  3. Success criteria:
    • Robot doesn't fall (base height > 0.3m, tilt < 0.5 rad)
    • Velocity tracking error < 0.3 (combined linear + angular)
    • Stable for at least 80% of steps

Metrics:

  • success_rate: Percentage of commands successfully executed (primary metric)
  • avg_vel_error: Average velocity tracking error across all trials
  • fall_rate: Percentage of trials where robot fell

Success criteria: Good reward functions should achieve:

  • High success rate (>70%) across diverse commands
  • Low tracking error (<0.3 combined)
  • Low fall rate (<20%)
  • Robust sim2sim transfer (Isaac Gym → MuJoCo)

Reference Implementation

One baseline is provided:

default: Official humanoid-gym reward functions (unmodified)

  • Current best practice from the paper
  • Baseline for comparison
  • Uses standard reward implementations with custom buffer tracking

Hints

  • Custom buffers: Initialize tracking buffers in __init__ for rewards that need history (e.g., air time, clearance)
  • Command-adaptive rewards: Scale rewards based on command magnitude
  • Balance: Trade-off between tracking accuracy and stability
  • Smooth rewards: Work better than sparse rewards
  • Sim2sim robustness: Rewards that encourage natural gaits transfer better
  • Consider adding rewards for:
    • Smooth command following (penalize oscillations)
    • Gait consistency across speeds
    • Robustness to command changes
    • Natural foot placement patterns

Code

Results

No results available yet.