Table 3 Hyperparameters of PPO

From: Comparative study of algorithms for optimized control of industrial energy supply systems

Hyperparameter Meaning Chosen Value
γ discount factor 0.99
Nenvs number of environments run in parallel 4 (during training)
Nsteps number of steps before update 256 (System A) / 512 (System B)
max max. value for gradient clipping 0.5
α learning rate 2 ×10−4
c1 loss coefficient for the value function 0.5
c2 loss coefficient for the entropy function 0.01
λ factor for bias/variance ratio 0.95
Nmb number of mini-batches per update 4
Nepochs number of epochs per surrogate update 4
εclip limit between new and old policy 0.2
netarch neural network topology, neurons per layer MLP [500, 400, 300] (ReLU)