Maximum Entropy RL for High-Dimensional Humanoid Control

FastDSAC

Unlocking stochastic policies in high-dimensional humanoid control with dimension-wise entropy modulation and a continuous distributional critic.

High-dimensionalHumanoidBench control
35benchmark tasks
~880 / ~700Basketball / Balance Hard returns

Fast stochastic control without giving up high-throughput training.

Modern humanoid RL is dominated by deterministic policy gradients because they remain stable in massively parallel simulators. FastDSAC shows that maximum entropy RL can compete in this regime when exploration and value estimation are redesigned for large action spaces.

The method keeps the efficient FastTD3-style training recipe, but replaces uniform action noise with a learned exploration budget and models returns with a continuous Gaussian critic.

Basketball performance teaser
Basketball: precision whole-body coordination.
Balance Hard performance teaser
Balance Hard: stability under high-dimensional control.

Two algorithmic pieces carry the scaling.

01

Dimension-wise Entropy Modulation

DEM learns an action-wise variance budget, suppressing noise on precision-critical joints while preserving exploration capacity on redundant or task-irrelevant dimensions.

02

Continuous Distributional Critic

A Gaussian return critic avoids fixed C51 supports and quantization artifacts, improving value fidelity across tasks with different reward scales.

FastDSAC architecture
FastDSAC architecture. The figure is presented as a horizontal rail to preserve the original paper scale and keep labels readable. Open full-size

Basketball near 900 and Balance Hard near 700.

Main benchmark panels come first. Multi-seed IQM curves with 95% confidence intervals then provide statistical evidence for the hardest standout tasks.

Main benchmark comparison from the paper
Latest aggregate benchmark comparison. Open full-size
Multi-seed IQM with 95 percent confidence intervals on Basketball and Balance Hard
Basketball and Balance Hard: multi-seed IQM with 95% confidence intervals. Open full-size
Multi-seed IQM with 95 percent confidence intervals on Window
Window: multi-seed IQM with 95% confidence intervals. Open full-size
Full HumanoidBench results
Full HumanoidBench benchmark results. Open full-size
IsaacLab and MuJoCo Playground benchmark results
IsaacLab and MuJoCo Playground full results. Open full-size
Wall-clock comparison
Wall-clock comparison against FastTD3. Open full-size

HumanoidBench comparisons, MuJoCo Playground, and real G1 transfer.

HumanoidBench: FastDSAC vs FastTD3

Basketball · FastDSAC

FastDSAC Basketball

Basketball · FastTD3

Balance Hard · FastDSAC

FastDSAC Balance Hard

Balance Hard · FastTD3

MuJoCo Playground

G1 flat terrain MuJoCo Playground demo

G1 Flat

G1 rough terrain MuJoCo Playground demo

G1 Rough

T1 flat terrain MuJoCo Playground demo

T1 Flat

T1 rough terrain MuJoCo Playground demo

T1 Rough

Unitree G1 zero-shot deployment

Forward / backward locomotion

Motion tracking

Robustness test

FastDSAC preprint

@article{xue2026fastdsac,
  title   = {FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control},
  author  = {Xue, Jun and Wang, Junze and Wang, Shanze and Zhang, Xinming and Chen, Yanjun and Zhang, Wei},
  journal = {arXiv preprint arXiv:2603.12612},
  year    = {2026},
  doi     = {10.48550/arXiv.2603.12612},
  url     = {https://arxiv.org/abs/2603.12612}
}