robo-humanoid-sim2real-algo

Otherhumanoid-gymrigorous codebase

Description

Humanoid Robot Sim2Real: Algorithm Design

Objective

Design novel reinforcement learning algorithms for humanoid robot locomotion that achieve robust sim-to-real transfer. You will implement custom algorithm components in the PPO (Proximal Policy Optimization) framework that enable policies to follow diverse velocity commands with natural, stable gaits.

Research Question

What algorithm implementations (network architecture, policy optimization, experience replay) lead to policies that can successfully execute diverse locomotion commands in sim2sim transfer (Isaac Gym → MuJoCo)?

The key challenge: standard PPO implementations often struggle with diverse command following and sim-to-real transfer. Your algorithm modifications should improve:

Policy robustness across varied commands (different speeds, directions, turning rates)
Sample efficiency during training
Generalization from simulation to simulation (Isaac Gym → MuJoCo)
Natural, energy-efficient gaits
Stable transitions between different commands

Background

The humanoid locomotion task requires the robot to track 3D velocity commands:

vx: forward/backward velocity (m/s)
vy: lateral velocity (m/s)
dyaw: yaw angular velocity (rad/s)

Standard PPO algorithm consists of three main components:

Actor-Critic Network (actor_critic.py): Neural network architecture with separate actor (policy) and critic (value function) heads
PPO Optimizer (ppo.py): Policy optimization using clipped surrogate objective, value function loss, and entropy regularization
Rollout Storage (rollout_storage.py): Experience buffer for collecting and processing trajectories

The problem: Standard implementations often:

Struggle with diverse command distributions
Have poor sample efficiency on complex locomotion tasks
Fail to transfer between simulators (Isaac Gym → MuJoCo)
Produce unnatural or unstable gaits
Require extensive hyperparameter tuning

Task

Implement custom algorithm components in the EDITABLE SECTIONS:

1. Actor-Critic Network: actor_critic_custom.py (lines 36-128)

In ActorCritic.__init__ method:

Design custom network architecture (layer sizes, activation functions)
Add normalization layers (LayerNorm, BatchNorm, custom)
Implement custom initialization schemes
Add auxiliary heads or features

In ActorCritic.act method:

Modify action sampling strategy
Add custom exploration mechanisms
Implement action post-processing

In ActorCritic.evaluate_actions method:

Customize value function computation
Modify action log probability calculation
Add auxiliary losses or regularization

2. PPO Optimizer: ppo_custom.py (lines 39-185)

In PPO.__init__ method:

Configure optimizer settings
Set up learning rate schedules
Initialize custom training components

In PPO.update method:

Modify policy loss computation (clipping strategy, advantage normalization)
Customize value function loss (Huber loss, clipping, multi-step returns)
Adjust entropy regularization
Implement custom gradient clipping or normalization
Add auxiliary losses (e.g., behavioral cloning, imitation)

3. Rollout Storage: rollout_storage_custom.py (lines 32-182)

In RolloutStorage.__init__ method:

Design custom buffer structure
Add additional tracking tensors

In RolloutStorage.add_transitions method:

Customize how experiences are stored
Add data augmentation or preprocessing

In RolloutStorage.compute_returns method:

Modify advantage estimation (GAE parameters, normalization)
Implement custom return computation (n-step, λ-returns)
Add reward shaping or preprocessing

Fixed components:

Environment and reward functions
Training: 4096 parallel environments, 10000 iterations
Observation/action spaces

Evaluation

Trained in Isaac Gym (4096 parallel envs, 10000 iterations), then evaluated on 100 diverse random commands in MuJoCo (sim2sim transfer):

Test procedure:

Sample 100 random commands from ranges:
- vx: [-0.5, 1.0] m/s
- vy: [-0.4, 0.4] m/s
- dyaw: [-0.5, 0.5] rad/s
For each command, run 10-second episode in MuJoCo
Success criteria:
- Robot doesn't fall (base height > 0.3m, tilt < 0.5 rad)
- Velocity tracking error < 0.3 (combined linear + angular)
- Stable for at least 80% of steps

Metrics:

success_rate: Percentage of commands successfully executed (primary metric)
avg_vel_error: Average velocity tracking error across all trials
fall_rate: Percentage of trials where robot fell

Success criteria: Good algorithms should achieve:

High success rate (>70%) across diverse commands
Low tracking error (<0.3 combined)
Low fall rate (<20%)
Robust sim2sim transfer (Isaac Gym → MuJoCo)

Reference Implementation

One baseline is provided:

default: Standard PPO implementation from humanoid-gym

3-layer MLP with [512, 256, 128] hidden units
Standard PPO loss with clipping (ε=0.2)
GAE with λ=0.95, γ=0.99
Baseline for comparison

Hints

Network architecture: Deeper networks, normalization layers, or residual connections may improve learning
Advantage normalization: Proper normalization can stabilize training
Value function: Accurate value estimates improve policy learning
Exploration: Entropy regularization or noise injection can help exploration
Sample efficiency: Better advantage estimation or multi-step returns can improve efficiency
Sim2sim transfer: Algorithms that learn robust features transfer better
Consider:
- Adaptive learning rates or schedules
- Custom loss weightings (policy vs value vs entropy)
- Gradient clipping strategies
- Observation/action normalization
- Auxiliary tasks or losses

Code

actor_critic_custom.py

EditableRead-only

1# SPDX-FileCopyrightText: Copyright (c) 2021 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2# SPDX-FileCopyrightText: Copyright (c) 2021 ETH Zurich, Nikita Rudin
3# SPDX-License-Identifier: BSD-3-Clause
4#
5# Redistribution and use in source and binary forms, with or without
6# modification, are permitted provided that the following conditions are met:
7#
8# 1. Redistributions of source code must retain the above copyright notice, this
9# list of conditions and the following disclaimer.
10#
11# 2. Redistributions in binary form must reproduce the above copyright notice,
12# this list of conditions and the following disclaimer in the documentation
13# and/or other materials provided with the distribution.
14#
15# 3. Neither the name of the copyright holder nor the names of its

ppo_custom.py

EditableRead-only

1# SPDX-FileCopyrightText: Copyright (c) 2021 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2# SPDX-FileCopyrightText: Copyright (c) 2021 ETH Zurich, Nikita Rudin
3# SPDX-License-Identifier: BSD-3-Clause
4#
5# Redistribution and use in source and binary forms, with or without
6# modification, are permitted provided that the following conditions are met:
7#
8# 1. Redistributions of source code must retain the above copyright notice, this
9# list of conditions and the following disclaimer.
10#
11# 2. Redistributions in binary form must reproduce the above copyright notice,
12# this list of conditions and the following disclaimer in the documentation
13# and/or other materials provided with the distribution.
14#
15# 3. Neither the name of the copyright holder nor the names of its

rollout_storage_custom.py

EditableRead-only

1# SPDX-FileCopyrightText: Copyright (c) 2021 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2# SPDX-FileCopyrightText: Copyright (c) 2021 ETH Zurich, Nikita Rudin
3# SPDX-License-Identifier: BSD-3-Clause
4#
5# Redistribution and use in source and binary forms, with or without
6# modification, are permitted provided that the following conditions are met:
7#
8# 1. Redistributions of source code must retain the above copyright notice, this
9# list of conditions and the following disclaimer.
10#
11# 2. Redistributions in binary form must reproduce the above copyright notice,
12# this list of conditions and the following disclaimer in the documentation
13# and/or other materials provided with the distribution.
14#
15# 3. Neither the name of the copyright holder nor the names of its

Additional context files (read-only):

humanoid-gym/humanoid/algo/ppo/actor_critic.py
humanoid-gym/humanoid/algo/ppo/ppo.py
humanoid-gym/humanoid/algo/ppo/rollout_storage.py

Results

No results yet.