meta-rl-algorithm
Description
Meta-RL Algorithm Design
Objective
Design a complete meta-reinforcement learning algorithm for fast adaptation to new tasks from limited interaction data. You must implement both the agent (how to encode context and condition the policy) and the training algorithm (how to meta-train the agent across tasks).
Background
Meta-RL algorithms learn to learn: they train across a distribution of tasks so that at test time, the agent can quickly adapt to a new, unseen task from just a few interactions. The key challenge is designing:
- Task inference: How to encode past experience (context) into a compact task representation
- Policy conditioning: How to condition the policy on this task representation
- Meta-training: How to optimize the agent across tasks so it generalizes to new ones
Different approaches exist: PEARL uses a probabilistic encoder with product-of-Gaussians aggregation; FOCAL uses contrastive learning for task embeddings; VariBAD uses a recurrent encoder with reward prediction.
Your Task
Modify the CustomMetaRLAgent and CustomMetaRLAlgorithm classes in custom_meta_rl.py. The template provides fixed infrastructure (environment setup, evaluation, replay buffers, network building blocks) — you design the algorithm.
Agent Interface (CustomMetaRLAgent)
Your agent must implement:
get_action(obs, deterministic=False)->(action_np, agent_info)— sample action conditioned on task beliefupdate_context(transition_tuple)->None— accumulate online experience (called during rollout)adapt()->None— perform task inference from collected context (called after exploration)clear_context(num_tasks=1)->None— reset context and task beliefinfer_posterior(context_tensor)->None— encode context from replay buffer (for training)contextproperty — return collected contextzattribute — latent task variable tensornetworksproperty — list of nn.Module for GPU transfer
Algorithm Interface (CustomMetaRLAlgorithm)
Your algorithm must implement:
collect_initial_data()— gather initial exploration data for all training taskstrain_iteration(iteration_idx)->dict— one meta-training iteration (data collection + gradient updates)networksproperty — all networks for GPU transfer
Available Utilities
The template provides these fixed utilities you can use:
build_mlp(input_dim, output_dim, hidden_dim, n_layers)— simple MLPbuild_policy(obs_dim, action_dim, latent_dim, net_size)— TanhGaussianPolicybuild_qf(obs_dim, action_dim, latent_dim, net_size)— Q-functionbuild_vf(obs_dim, latent_dim, net_size)— V-functioncreate_replay_buffers(env, tasks)— replay buffer pairsample_context_from_buffer(enc_replay_buffer, indices, batch_size, ...)— sample contextsample_sac_batch(replay_buffer, indices, batch_size)— sample RL batchcollect_data(agent, env, sampler, replay_buffer, enc_replay_buffer, ...)— collect trajectoriesInPlacePathSamplerfrom rlkit — trajectory sampler
Environments
Three MuJoCo environments with different challenges:
-
Half-Cheetah Velocity (
cheetah-vel): 30 train / 10 test tasks. Target velocities in [0, 3] m/s. Obs dim 20, action dim 6. Dense reward (velocity matching). High-dimensional observations require strong encoding. -
Sparse Point Robot (
sparse-point-robot): 40 train / 10 test tasks. Goals on a half-circle, sparse reward (+1 near goal, 0 otherwise). Obs dim 2, action dim 2. Sparse reward makes task inference especially challenging. -
Point Robot (
point-robot): 40 train / 10 test tasks. Goals in [-1, 1]^2. Dense reward (neg L2 distance). Obs dim 2, action dim 2. Tests basic meta-learning quality.
Evaluation
Performance is measured by meta_test_return on each environment: average return on held-out test tasks after meta-training. The evaluation protocol collects exploration trajectories, calls agent.adapt(), then evaluates with a deterministic policy.
Key Design Dimensions
- Context encoding: Permutation-invariant (MLP + aggregation) vs. sequential (RNN/GRU) vs. attention
- Task variable: Probabilistic (information bottleneck) vs. deterministic
- Encoder loss: KL divergence, contrastive, reward prediction, or reconstruction
- RL algorithm: SAC variants, policy gradient, or other
Code
1"""Custom meta-RL algorithm template for meta-rl-algorithm task.23FIXED infrastructure (not editable): environment setup, network building blocks,4replay buffers, sampler, evaluation protocol, and outer training loop.5EDITABLE region: CustomMetaRLAgent and CustomMetaRLAlgorithm classes.6"""7import os8import sys9import copy10import argparse11import numpy as np1213import torch14import torch.nn as nn15import torch.nn.functional as F
Additional context files (read-only):
oyster/rlkit/torch/networks.pyoyster/rlkit/torch/sac/policies.pyoyster/configs/default.py
Results
| Model | Type | meta test return point robot ↑ | meta test return cheetah vel ↑ | meta test return sparse point robot ↑ |
|---|---|---|---|---|
| focal | baseline | -12.862 | -91.923 | 0.233 |
| pearl | baseline | -15.468 | -64.634 | 5.491 |
| varibad | baseline | -12.494 | -69.431 | 0.000 |
| anthropic/claude-opus-4.6 | vanilla | -11.160 | -74.851 | 0.256 |
| anthropic/claude-opus-4.6 | vanilla | -14.165 | -64.144 | 4.850 |
| anthropic/claude-opus-4.6 | vanilla | -15.411 | -67.187 | 4.876 |
| deepseek-reasoner | vanilla | -22.428 | -277.317 | 0.000 |
| openai/gpt-5.4-pro | vanilla | -15.459 | -52.637 | 0.000 |
| openai/gpt-5.4-pro | vanilla | -11.774 | -74.712 | 0.000 |
| openai/gpt-5.4-pro | vanilla | -9.803 | -87.606 | 0.000 |
| anthropic/claude-opus-4.6 | agent | -14.056 | -56.088 | 1.098 |
| anthropic/claude-opus-4.6 | agent | -15.326 | -57.070 | 0.000 |
| anthropic/claude-opus-4.6 | agent | -13.009 | -55.973 | 4.140 |
| deepseek-reasoner | agent | -22.430 | -276.766 | 0.000 |
| deepseek-reasoner | agent | -22.428 | -277.418 | 0.000 |
| deepseek-reasoner | agent | -22.429 | -277.236 | 0.000 |
| openai/gpt-5.4-pro | agent | -11.934 | -80.094 | 1.045 |