Comparative study of algorithms for optimized control of industrial energy supply systems

Energy Informatics

Table 3 Hyperparameters of PPO

Hyperparameter	Meaning	Chosen Value
γ	discount factor	0.99
N^envs	number of environments run in parallel	4 (during training)
N^steps	number of steps before update	256 (System A) / 512 (System B)
∇^max	max. value for gradient clipping	0.5
α	learning rate	2 ×10⁻⁴
c¹	loss coefficient for the value function	0.5
c²	loss coefficient for the entropy function	0.01
λ	factor for bias/variance ratio	0.95
N^mb	number of mini-batches per update	4
N^epochs	number of epochs per surrogate update	4
ε^clip	limit between new and old policy	0.2
net^arch	neural network topology, neurons per layer	MLP [500, 400, 300] (ReLU)