Maximum Entropy RL for High-Dimensional Humanoid Control

FastDSAC

Unlocking stochastic policies in high-dimensional humanoid control with dimension-wise entropy modulation and a continuous distributional critic.

Paper Code

High-dimensionalHumanoidBench control

35benchmark tasks

~880 / ~700Basketball / Balance Hard returns

Overview

Fast stochastic control without giving up high-throughput training.

Modern humanoid RL is dominated by deterministic policy gradients because they remain stable in massively parallel simulators. FastDSAC shows that maximum entropy RL can compete in this regime when exploration and value estimation are redesigned for large action spaces.

The method keeps the efficient FastTD3-style training recipe, but replaces uniform action noise with a learned exploration budget and models returns with a continuous Gaussian critic.

Basketball performance teaser — Basketball: precision whole-body coordination.

Balance Hard performance teaser — Balance Hard: stability under high-dimensional control.

Method

Two algorithmic pieces carry the scaling.

Dimension-wise Entropy Modulation

DEM learns an action-wise variance budget, suppressing noise on precision-critical joints while preserving exploration capacity on redundant or task-irrelevant dimensions.

Continuous Distributional Critic

A Gaussian return critic avoids fixed C51 supports and quantization artifacts, improving value fidelity across tasks with different reward scales.

FastDSAC architecture. The figure is presented as a horizontal rail to preserve the original paper scale and keep labels readable. Open full-size

Results

Basketball near 900 and Balance Hard near 700.

Main benchmark panels come first. Multi-seed IQM curves with 95% confidence intervals then provide statistical evidence for the hardest standout tasks.

Main benchmark comparison from the paper — Latest aggregate benchmark comparison. Open full-size

Multi-seed IQM with 95 percent confidence intervals on Basketball and Balance Hard — Basketball and Balance Hard: multi-seed IQM with 95% confidence intervals. Open full-size

Multi-seed IQM with 95 percent confidence intervals on Window — Window: multi-seed IQM with 95% confidence intervals. Open full-size

Full HumanoidBench results — Full HumanoidBench benchmark results. Open full-size

IsaacLab and MuJoCo Playground benchmark results — IsaacLab and MuJoCo Playground full results. Open full-size

Wall-clock comparison against FastTD3. Open full-size

Demos

HumanoidBench comparisons, MuJoCo Playground, and real G1 transfer.

HumanoidBench: FastDSAC vs FastTD3

Basketball · FastDSAC

Basketball · FastTD3

Balance Hard · FastDSAC

Balance Hard · FastTD3

MuJoCo Playground

G1 Flat

G1 Rough

T1 Flat

T1 Rough

Unitree G1 zero-shot deployment

Forward / backward locomotion

Motion tracking

Robustness test

Citation

FastDSAC preprint

@article{xue2026fastdsac,
  title   = {FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control},
  author  = {Xue, Jun and Wang, Junze and Wang, Shanze and Zhang, Xinming and Chen, Yanjun and Zhang, Wei},
  journal = {arXiv preprint arXiv:2603.12612},
  year    = {2026},
  doi     = {10.48550/arXiv.2603.12612},
  url     = {https://arxiv.org/abs/2603.12612}
}