robo-humanoid-sim2real-reward
Description
Humanoid Robot Sim2Real: Reward Function Design
Objective
Design novel reward functions for humanoid robot locomotion that achieve robust sim-to-real transfer. You will implement custom reward functions in humanoid_env_custom.py that encourage natural, stable gaits capable of following diverse velocity commands.
Research Question
What reward function implementations lead to policies that can successfully execute diverse locomotion commands in sim2sim transfer (Isaac Gym → MuJoCo)?
The key challenge: reward functions that work well for a single command often fail when the robot needs to follow varied commands (different speeds, directions, turning rates). Your reward functions should encourage:
- Robust tracking of velocity commands (vx, vy, dyaw)
- Natural, stable gaits across different speeds
- Smooth transitions between commands
- Energy-efficient movement
- Sim-to-sim transfer robustness
Background
The humanoid locomotion task requires the robot to track 3D velocity commands:
vx: forward/backward velocity (m/s)vy: lateral velocity (m/s)dyaw: yaw angular velocity (rad/s)
Existing reward functions (in humanoid_env.py) include:
_reward_tracking_lin_vel(): Track linear velocity commands_reward_tracking_ang_vel(): Track angular velocity commands_reward_feet_clearance(): Encourage foot lifting during swing_reward_foot_slip(): Penalize foot slipping_reward_orientation(): Keep torso upright_reward_base_height(): Maintain target height_reward_action_smoothness(): Encourage smooth actions- ... and 15+ more reward terms
The problem: These reward functions are designed and tuned for specific command ranges. When tested on diverse commands, policies often:
- Fail to track commands outside the training distribution
- Exhibit unstable gaits at certain speeds
- Have poor transitions between different commands
- Fail to transfer from Isaac Gym to MuJoCo
Task
Implement custom reward functions in the EDITABLE SECTIONS:
1. Environment file: humanoid_env_custom.py (lines 76-540)
In __init__ method (lines 76-81):
- Initialize custom tracking buffers (e.g.,
self.my_custom_buffer = torch.zeros(...)) - Track additional data needed for your reward functions
- Example:
self.feet_height,self.last_feet_zfor clearance rewards
In reward functions (lines 272-540):
- Modify existing reward function implementations
- Add new reward functions
- Change reward computation logic
- Design adaptive rewards based on command magnitude
- Use any available state tensors:
self.dof_pos,self.base_lin_vel,self.contact_forces, etc.
2. Config file: humanoid_config_custom.py (lines 174-216)
In rewards class:
- Adjust reward parameters (e.g.,
base_height_target,cycle_time,tracking_sigma) - Modify reward scales in the
scalessubclass to match your reward functions - Add new scale entries for new reward functions
- Remove scales for unused reward functions
- Balance reward contributions (e.g.,
tracking_lin_vel = 1.2,foot_slip = -0.05)
Important: The reward scales in the config must match the reward functions in the environment. If you add a new reward function _reward_my_custom(), you must add a corresponding scale my_custom = <value> in the config.
Fixed components:
- Network architecture: 3-layer MLP
- Training: PPO with 10000 iterations
- Observation/action spaces
Evaluation
Trained in Isaac Gym (4096 parallel envs, 10000 iterations), then evaluated on 100 diverse random commands in MuJoCo (sim2sim transfer):
Test procedure:
- Sample 100 random commands from ranges:
vx: [-0.5, 1.0] m/svy: [-0.4, 0.4] m/sdyaw: [-0.5, 0.5] rad/s
- For each command, run 10-second episode in MuJoCo
- Success criteria:
- Robot doesn't fall (base height > 0.3m, tilt < 0.5 rad)
- Velocity tracking error < 0.3 (combined linear + angular)
- Stable for at least 80% of steps
Metrics:
success_rate: Percentage of commands successfully executed (primary metric)avg_vel_error: Average velocity tracking error across all trialsfall_rate: Percentage of trials where robot fell
Success criteria: Good reward functions should achieve:
- High success rate (>70%) across diverse commands
- Low tracking error (<0.3 combined)
- Low fall rate (<20%)
- Robust sim2sim transfer (Isaac Gym → MuJoCo)
Reference Implementation
One baseline is provided:
default: Official humanoid-gym reward functions (unmodified)
- Current best practice from the paper
- Baseline for comparison
- Uses standard reward implementations with custom buffer tracking
Hints
- Custom buffers: Initialize tracking buffers in
__init__for rewards that need history (e.g., air time, clearance) - Command-adaptive rewards: Scale rewards based on command magnitude
- Balance: Trade-off between tracking accuracy and stability
- Smooth rewards: Work better than sparse rewards
- Sim2sim robustness: Rewards that encourage natural gaits transfer better
- Consider adding rewards for:
- Smooth command following (penalize oscillations)
- Gait consistency across speeds
- Robustness to command changes
- Natural foot placement patterns
Code
Results
No results available yet.