- Entropy-TRPO Model Weights
- Repository layout
- Notation
- Variant definitions
- Environments
- Variants and paper sources
- Usage
- Citation
- Training progress
- Humanoid-v5 (1M benchmark)
- HumanoidStandup-v5 (1M benchmark)
- CartPole-v1 (CARTPOLE_100K benchmark)
- CartPole-v1 (CARTPOLE_50K benchmark)
- CartPole-v1 (CARTPOLE_XU benchmark)
- Available checkpoints
- Repository layout
Entropy-TRPO Model Weights
PyTorch checkpoints for the comparative study in A Review of Entropy-Based Extensions to Trust Region Policy Optimization.
Repository layout
Each checkpoint directory contains:
| File | Description |
|---|---|
policy.pt |
Policy network state dict |
value.pt |
Value network state dict |
config.json |
Training hyperparameters |
metadata.json |
Paper source, variant flags, final metrics |
Hub path: {env_id}/{variant}/latest/ (e.g. CartPole-v1/entrpo/latest/).
The repo README is updated automatically during training with a Training progress table
(epoch n/N, eval return, best return, KL) from results/summary.md, plus a JSON index of
available checkpoints.
Notation
- $\rho_t(\theta)=\pi_\theta(a_t|s_t)/\pi_{\theta_{\text{old}}}(a_t|s_t)$, GAE advantages $\hat{A}_t$, trust-region radius $\delta$
- $\alpha$ — Roostaie advantage entropy; Xu ERO objective entropy (distinct roles, same symbol in each paper's row)
- $\beta$ — Xu ERC constraint coefficient (Xu Eq. 49)
- $c_{\mathrm{ent}}$ — PPO entropy bonus (Schulman et al., 2017; config field
entropy_coef)
Variant definitions
| Key | Paper name | Surrogate / constraint |
|---|---|---|
trpo |
TRPO | $\mathbb{E}[\rho_t \hat{A}t]$; $\bar D{\mathrm{KL}} \le \delta$ |
entrpo_entropy |
EnTRPO-Entropy | $\mathbb{E}[\rho_t \tilde{A}t]$, $\tilde{A}t=\hat{A}t+\alpha,\mathcal{H}(\pi{\theta{\text{old}}}(\cdot|s_t))$ (fixed during step); $\bar D{\mathrm{KL}} \le \delta$ |
ero_trpo |
ERO-TRPO | $\mathbb{E}[\rho_t \hat{A}t]+\alpha,\mathbb{E}[\mathcal{H}(\pi_\theta)]$; $\bar D{\mathrm{KL}} \le \delta$ |
erc_trpo |
ERC-TRPO | $\mathbb{E}[\rho_t \hat{A}t]$; $\bar D{\mathrm{KL}} \le \delta+\beta,\mathbb{E}[\mathcal{H}(\pi_\theta)]$ (Xu Eq. 49) |
entrpo_buffer |
EnTRPO-Buffer | $\mathbb{E}[\rho_t \hat{A}_t]$ with Roostaie on-policy replay |
entrpo |
EnTRPO | $\mathbb{E}[\rho_t \tilde{A}_t]$ + Roostaie buffer |
ppo |
PPO | $\mathbb{E}[\min(\rho_t \hat{A}_t,\mathrm{clip}(\rho_t)\hat{A}t)]+c{\mathrm{ent}}\mathbb{E}[\mathcal{H}(\pi_\theta)]$ |
$\mathcal{H}$ in EnTRPO rows is evaluated at the behavior policy $\pi_{\theta_{\text{old}}}$; in ERO/ERC/PPO rows at the candidate policy $\pi_\theta$.
ERC-TRPO implementation: follows Xu Table1 (two CG solves, $\eta\mathbf{u}+\beta\mathbf{v}$ step scaling) and Eq.(49) line-search acceptance $\bar{D}_{\mathrm{KL}}\le\delta+\beta,\mathbb{E}[\mathcal{H}]$.
Older Hub folders (trpo_entropy, trpo_buffer, …) remain valid; training resumes from them automatically.
Environments
| Environment | Obs / action | Training budget | Hyperparameter source |
|---|---|---|---|
| CartPole-v1 | Gymnasium classic control | 50k steps | Roostaie + Xu Table 4.4 directly |
| Humanoid-v5 | 348 / 17 | $10^6$ steps | PPO/baselines backbone; Xu ERO/ERC proxied from Walker2d |
| HumanoidStandup-v5 | 348 / 17 | $10^6$ steps | Same backbone; Xu ERO/ERC proxied from BipedalWalker |
See HYPERPARAMETERS.md for per-field provenance and paper/results/annex_hyperparameters.tex for tables.
Variants and paper sources
| Variant | Paper |
|---|---|
trpo |
Schulman et al. (2015), Trust Region Policy Optimization, ICML |
entrpo_entropy |
Roostaie & Ebadzadeh (2021), EnTRPO — entropy-in-advantage ablation |
entrpo_buffer |
Roostaie & Ebadzadeh (2021), EnTRPO — replay-buffer ablation |
entrpo |
Roostaie & Ebadzadeh (2021), EnTRPO — full method |
ero_trpo |
Xu et al. (2024), ERO-TRPO |
erc_trpo |
Xu et al. (2024), ERC-TRPO |
ppo |
Schulman et al. (2017), Proximal Policy Optimization |
See metadata.json in each folder for full author names and URLs.
Usage
Training and evaluation code: GitHub — entropy-trpo (update URL when published).
git clone https://github.com/pre63/entropy-trpo.git
cd entropy-trpo
make setup # install deps + create .env
# edit .env with HF_TOKEN and HF_REPO_ID
make download-weights
make eval-checkpoints
Citation
@article{entropytrporeview2026,
title = {A Review of Entropy-Based Extensions to Trust Region Policy Optimization},
author = {Green, Simon},
journal = {IEEE Transactions},
year = {2026}
}
@article{roostaie2021entrpo,
title = {EnTRPO: Trust Region Policy Optimization Method with Entropy Regularization},
author = {Roostaie, Sahar and Ebadzadeh, Mohammad Mehdi},
journal = {arXiv:2110.13373},
year = {2021}
}
@article{xu2024trpo,
title = {Trust region policy optimization via entropy regularization for {Kullback--Leibler} divergence constraint},
author = {Xu, Haotian and Xuan, Junyu and Zhang, Guangquan and Lu, Jie},
journal = {Neurocomputing},
volume = {589},
pages = {127716},
year = {2024}
}
Training progress
Last updated: 2026-06-26 13:13:05 UTC
- Device:
cpu - Config:
configs/cpu.yaml - Jobs complete: 48/63
- Running: 1
Humanoid-v5 (1M benchmark)
| Variant | Status | Epoch | Timesteps | Eval return | Best | KL |
|---|---|---|---|---|---|---|
| TRPO (s0) | pending | 0/976 | 0 | — | — | — |
| TRPO (s1) | pending | 0/976 | 0 | — | — | — |
| TRPO (s2) | pending | 0/976 | 0 | — | — | — |
| EnTRPO-Entropy (s0) | done | 488/488 | 999,424 | 256.2 ± 58.3 | 312.4 | -0.0000 |
| EnTRPO-Entropy (s1) | done | 488/488 | 999,424 | 273.6 ± 55.0 | 325.6 | 0.0072 |
| EnTRPO-Entropy (s2) | done | 488/488 | 999,424 | 256.1 ± 72.0 | 333.6 | -0.0000 |
| ERO-TRPO (s0) | done | 488/488 | 999,424 | 250.4 ± 54.5 | 342.4 | 0.0000 |
| ERO-TRPO (s1) | done | 488/488 | 999,424 | 250.6 ± 23.7 | 315.4 | 0.0071 |
| ERO-TRPO (s2) | done | 488/488 | 999,424 | 261.1 ± 65.2 | 329.1 | 0.0053 |
| ERC-TRPO (s0) | done | 488/488 | 1,001,472 | 108.9 ± 27.7 | 130.1 | -0.0000 |
| ERC-TRPO (s1) | done | 488/488 | 999,424 | 254.5 ± 43.4 | 258.7 | -0.0000 |
| ERC-TRPO (s2) | done | 488/488 | 999,424 | 217.7 ± 79.6 | 240.5 | 0.0000 |
| EnTRPO-Buffer (s0) | done | 488/488 | 999,424 | 267.4 ± 72.3 | 326.5 | 0.0000 |
| EnTRPO-Buffer (s1) | done | 488/488 | 999,424 | 252.4 ± 22.5 | 327.7 | 0.0000 |
| EnTRPO-Buffer (s2) | done | 488/488 | 999,424 | 249.7 ± 90.1 | 321.0 | 0.0055 |
| EnTRPO (s0) | done | 488/488 | 999,424 | 245.7 ± 32.2 | 332.4 | 0.0043 |
| EnTRPO (s1) | done | 488/488 | 999,424 | 289.4 ± 72.0 | 325.4 | 0.0074 |
| EnTRPO (s2) | done | 488/488 | 999,424 | 280.4 ± 83.2 | 316.8 | 0.0023 |
| PPO (s0) | pending | 0/1953 | 0 | — | — | — |
| PPO (s1) | pending | 0/1953 | 0 | — | — | — |
| PPO (s2) | pending | 0/1953 | 0 | — | — | — |
HumanoidStandup-v5 (1M benchmark)
| Variant | Status | Epoch | Timesteps | Eval return | Best | KL |
|---|---|---|---|---|---|---|
| TRPO (s0) | pending | 0/976 | 0 | — | — | — |
| TRPO (s1) | pending | 0/976 | 0 | — | — | — |
| TRPO (s2) | pending | 0/976 | 0 | — | — | — |
| EnTRPO-Entropy (s0) | done | 488/488 | 999,424 | 63970.3 ± 12674.7 | 69688.4 | 0.0054 |
| EnTRPO-Entropy (s1) | done | 488/488 | 999,424 | 74092.3 ± 6436.0 | 84192.4 | -0.0000 |
| EnTRPO-Entropy (s2) | done | 488/488 | 999,424 | 66211.8 ± 11884.0 | 73910.5 | -0.0000 |
| ERO-TRPO (s0) | done | 488/488 | 999,424 | 46220.5 ± 2917.4 | 47845.4 | -0.0000 |
| ERO-TRPO (s1) | done | 488/488 | 999,424 | 47466.5 ± 3243.4 | 52235.4 | 0.0010 |
| ERO-TRPO (s2) | done | 488/488 | 999,424 | 47797.9 ± 3607.9 | 50942.6 | 0.0000 |
| ERC-TRPO (s0) | done | 488/488 | 999,424 | 38592.2 ± 4701.3 | 40352.0 | 0.0000 |
| ERC-TRPO (s1) | done | 488/488 | 999,424 | 47285.0 ± 3076.0 | 51595.2 | 0.0003 |
| ERC-TRPO (s2) | done | 488/488 | 999,424 | 48414.2 ± 3424.0 | 51013.4 | 0.0009 |
| EnTRPO-Buffer (s0) | done | 488/488 | 999,424 | 37442.2 ± 2643.0 | 39323.6 | 0.0086 |
| EnTRPO-Buffer (s1) | done | 488/488 | 999,424 | 41544.7 ± 3903.9 | 45498.8 | -0.0001 |
| EnTRPO-Buffer (s2) | done | 488/488 | 999,424 | 37246.1 ± 1735.1 | 40451.8 | 0.0000 |
| EnTRPO (s0) | done | 488/488 | 999,424 | 35912.7 ± 2063.5 | 39687.9 | 0.0000 |
| EnTRPO (s1) | done | 488/488 | 999,424 | 43605.8 ± 3502.3 | 46649.5 | 0.0042 |
| EnTRPO (s2) | done | 488/488 | 999,424 | 37992.1 ± 4942.4 | 40168.8 | 0.0064 |
| PPO (s0) | pending | 0/1953 | 0 | — | — | — |
| PPO (s1) | pending | 0/1953 | 0 | — | — | — |
| PPO (s2) | pending | 0/1953 | 0 | — | — | — |
CartPole-v1 (CARTPOLE_100K benchmark)
| Variant | Status | Epoch | Timesteps | Eval return | Best | KL |
|---|---|---|---|---|---|---|
| TRPO (s0) | done | 195/195 | 99,840 | 226.1 ± 31.8 | 472.6 | 0.0034 |
| TRPO (s1) | done | 195/195 | 99,840 | 208.5 ± 218.4 | 500.0 | 0.0064 |
| TRPO (s2) | done | 195/195 | 99,840 | 266.5 ± 48.4 | 500.0 | 0.0017 |
| PPO (s0) | running | 133/3125 | 4,256 | 30.1 ± 12.6 | 69.1 | 0.0756 |
| PPO (s1) | pending | 0/3125 | 0 | — | — | — |
| PPO (s2) | pending | 0/3125 | 0 | — | — | — |
CartPole-v1 (CARTPOLE_50K benchmark)
| Variant | Status | Epoch | Timesteps | Eval return | Best | KL |
|---|---|---|---|---|---|---|
| EnTRPO-Entropy (s0) | done | 10/10 | 50,000 | 267.6 ± 56.1 | 324.0 | 0.0056 |
| EnTRPO-Entropy (s1) | done | 10/10 | 50,000 | 277.1 ± 87.7 | 297.8 | 0.0056 |
| EnTRPO-Entropy (s2) | done | 10/10 | 50,000 | 373.8 ± 92.4 | 373.8 | 0.0027 |
| EnTRPO-Buffer (s0) | done | 10/10 | 50,000 | 166.1 ± 65.5 | 176.6 | 0.0076 |
| EnTRPO-Buffer (s1) | done | 10/10 | 50,000 | 321.5 ± 106.3 | 340.6 | 0.0049 |
| EnTRPO-Buffer (s2) | done | 10/10 | 50,000 | 80.5 ± 33.3 | 165.7 | 0.0086 |
| EnTRPO (s0) | done | 10/10 | 50,000 | 212.3 ± 86.3 | 212.3 | 0.0037 |
| EnTRPO (s1) | done | 10/10 | 50,000 | 224.6 ± 65.8 | 224.6 | 0.0083 |
| EnTRPO (s2) | done | 10/10 | 50,000 | 186.2 ± 56.1 | 243.5 | 0.0036 |
CartPole-v1 (CARTPOLE_XU benchmark)
| Variant | Status | Epoch | Timesteps | Eval return | Best | KL |
|---|---|---|---|---|---|---|
| ERO-TRPO (s0) | done | 500/500 | 250,000 | 34.6 ± 20.2 | 43.8 | 0.0000 |
| ERO-TRPO (s1) | done | 500/500 | 250,000 | 42.5 ± 24.8 | 67.6 | 0.0000 |
| ERO-TRPO (s2) | done | 500/500 | 250,000 | 28.3 ± 12.4 | 60.4 | 0.0000 |
| ERC-TRPO (s0) | done | 500/500 | 250,000 | 19.9 ± 10.0 | 31.0 | 0.0000 |
| ERC-TRPO (s1) | done | 500/500 | 250,000 | 24.2 ± 12.8 | 43.5 | -0.0000 |
| ERC-TRPO (s2) | done | 500/500 | 250,000 | 19.7 ± 6.4 | 36.0 | 0.0000 |
Available checkpoints
{
"erc_trpo": [
"seed_0",
"seed_1",
"seed_2"
],
"ero_trpo": [
"seed_0",
"seed_1",
"seed_2"
],
"random_erc_trpo": [
"seed_0",
"seed_1",
"seed_2"
]
}