Our objective is to regulate the AEV charging power to minimize the charging time and voltage limit violations at the charging node. The power flow equations describe the relationship between power and voltage in an electrical distribution network. For simplicity, we do not consider reactive power control in our use case. However, it is important to note that the European LV grid benchmark has R/X ratios of 0.7–11.0 (Ayaz et al. 2018), which are relatively high, and at high R/X ratios, active power has the most significant influence on voltage (Blažič and Papič 2008).

The mathematical form of the objective function is given by Eq. 3. In Eq. 3, \(P_{max}\) is the maximum charging load (maximum charging power of a charging point times the number of charging points at the node), \(\alpha ^t = P^t_c/P_{max}\) is the ratio between the charging load at time *t* and \(P_{max}\), \(\mathcal {N}\) is the set of nodes in the LV grid, \(V_m\) is the voltage magnitude at the charging node, and \(V_{lb}\) is the statutory voltage limit. \(G_{ij}\) and \(B_{ij}\) are the real and imaginary parts of the bus admittance matrix corresponding to the \((i, j)^{th}\) element. \(\delta _{ij}\) is the voltage angle difference between the \(i^{th}\) and \(j^{th}\) buses. \(P_i\) and \(Q_i\) are the real and reactive power injections at node *i*.

$$\begin{aligned} \begin{aligned} \max _{\alpha ^{t}} \quad&{{\,\mathrm{\mathbb {E}}\,}}_{t \in \mathcal {T}} \biggl (\mathbbm {1}^{|V_m^t - V_{lb} |\le \zeta } + \mathbbm {1}^{V_m^t > V_{lb} + \zeta }\alpha ^{t}\biggr )\\ \text {s.t.} \quad&P_i^t = \left| {V_i^t}\right| \sum _{j \in \mathcal {N}} \left| {V_j^t}\right| \bigl ({G_{ij} \cos \delta _{ij}^t + B_{ij} \sin \delta _{ij}^t } \bigr ) \\&Q_i^t = \left| {V_i^t}\right| \sum _{j \in \mathcal {N}} \left| {V_j^t}\right| \bigl ({G_{ij} \sin \delta _{ij}^t - B_{ij} \cos \delta _{ij}^t } \bigr )\\&P^t_c = \alpha ^{t} P_{max} \end{aligned} \end{aligned}$$

(3)

Equation 3 is a concise way to combine both the charging power and voltage objectives. The statutory voltage limit is imposed as a soft constraint with a small allowable margin of error of \(\zeta\). In the case study, we set \(V_{lb}\) and \(\zeta\) to 0.95 and 0.01 respectively.

To solve the optimization problem with DRL, we need to define the states, actions, and reward function of the RL agent. Moreover, we employ the stochastic policy gradient approach that enables us to create an RL agent that learns the optimal stochastic policy directly from observations.

*States* We define two state vectors, one for the critic and another for the actor. The state vector of the actor is a local subset of the state vector available to the critic, which imposes the partial observability condition.

Critic’s state, denoted by \(S_c\), is a discrete-transformed vector of voltage magnitudes (in p.u.) at each load and generator-connected bus. Given the number of load-connected buses is *m*, *n* is the number of generator-connected buses, and *l* is the number of bins, the critic state at time *t* is a vector of the shape \((1, m + n, l)\). We impose partial observability by limiting the actor’s state to the *b* nearest load or generator buses from the charging node, including the charging node itself. Therefore, the actor state is a vector of the shape (1, *b*, *l*)^{Footnote 1}.

Transformation from continuous to a finite discrete domain is a simple but powerful state abstraction that reduces the size of the state space, improves convergence, and improves generalization properties of the model to unseen data. We recommend (Kirk et al. 2021) for more information related to the generalization of DRL models.

*Actions* The agent's policy yields an action at each time-step *t* that regulates the charging power. Therefore, we define the action of the stochastic charging agent as \(\alpha ^{t} = \pi (s^t)\). Clearly, \(\alpha ^{t}\) is a real value in the range [0, 1] that can be represented as a random realization of a beta policy, i.e., \(\alpha ^{t} \sim Beta(a, b)\). In other words, we can write the optimal stochastic policy \(\pi ^\star = Beta(a^\star , b^\star )\) where \(a^\star , b^\star\) are the optimal parameter values of the beta policy.

*Reward function* Reward functions require careful engineering. Efficient reward functions help guide the RL agent find the optimal policy by avoiding local optimal and improving the convergence speed (Dorokhova et al. 2021). Our problem has multiple objectives that should simultaneously minimize charging time and expected voltage violations. Therefore, following Eq. 3, we define the reward function as;

$$\begin{aligned} R(s^t, a^t) = \mathbbm {1}^{|V_m - V_{lb} |\le \zeta } + \alpha ^{t}\bigl (\mathbbm {1}^{V_m > V_{lb} + \zeta }\bigr ) \end{aligned}$$

(4)

*Model architecture* The policy network parameterizes the stochastic charging policy \(\pi _\theta\) that returns policy parameters *a* and *b* of a beta distribution. Beta distribution is a bounded distribution between 0 and 1; therefore, it is well-suited for representing the stochastic charging action \(\alpha ^{t} = \pi _\theta (s^t)\) of the agent. We encourage the reader to refer to the motivating examples (Chou et al. 2017; Petrazzini and Antonelo 2022) that describe the use of beta policy for solving policy gradient problems with bounded action spaces.

Value-network (the critic) is updated based on the mean-squared error (MSE) of the critic prediction and the immediate true reward. In other words, the agent’s interactions with the environment at each time-step is an episode consisting of only one step. Moreover, we implement a replay-buffer to improve the sample efficiency of the training process.

The architectures of the deep neural network that implement the actors and the centralized critic are depicted in Figure 1. The actor-network has two heads corresponding to the two parameters of the beta distribution of the stochastic policy that we need to estimate. The number of layers, layer dimensions, and layer activation functions are design choices based on hyper-parameter tuning.

*Charging power assignment* So far, we have designed a mathematical formulation that enables us to optimally control the total charging power at a node minimizing expected voltage violations and charging time. The assignment problem that we discuss now answers the question of the equitable allocation of the total charging power between the multiple vehicles that require charging simultaneously. We define equity as minimizing the sum of instantaneously evaluated charging times for all vehicles. This definition allows us to prioritize more depleted AEVs and charge them faster. Consequently, we expect more AEVs to be available for users, leading to better mobility services. The non-linear optimal power assignment problem can be written as in Eq. 5, where \(K^\prime\) is the set of active charging points. Furthermore, \(\alpha ^{k^\prime , t}\) is the charge rate of the charging point \(k^\prime\) at time *t*, and it is a real value in the range [\(\epsilon\), 1]. The lower-bound \(\epsilon\) is a very small real value introduced for numerical stability.

$$\begin{aligned} \min _{\alpha ^{k^\prime , t}} \frac{1 - SOC^{k^\prime , t}}{\alpha ^{k^\prime , t} + \epsilon } \\ \qquad \qquad \text {s.t.} \, & 0 \le \alpha ^t P_{max} - \sum _{k^{\prime } \in \mathcal {K^\prime }}\alpha ^{k^\prime, t}P^{k^\prime }_{max} \\ \quad & \alpha ^{k^\prime, t} \le \epsilon \quad \textit{if}\,SOC^{k^\prime, t} = 1 \\ \quad & \epsilon \le \alpha ^{k^\prime, t} \le 1 \end{aligned}$$

(5)

Figure 2 shows the combined optimization problem that we solve in iteration for each time step of the simulation.