A stochastic deep reinforcement learning agent for grid-friendly electric vehicle charging management

Electrification of the transportation sector provides several advantages in favor of climate protection and a shared economy. At the same time, the rapid growth of electric vehicles also demands innovative solutions to mitigate risks to the low-voltage network due to unpredictable charging patterns of electric vehicles. This article conceptualizes a stochastic reinforcement learning agent that learns the optimal policy for regulating the charging power. The optimization objective intends to reduce charging time, thus charging faster while minimizing the expected voltage violations in the distribution network. The problem is formulated as a two-stage optimization routine where the stochastic policy gradient agent predicts the boundary condition of the inner non-linear optimization problem. The results confirm the performance of the proposed architecture to control the charging power as intended. The article also provides extensive theoretical background and directions for future research in this discipline.

Algorithms such as rule-based approaches (e.g., Rauf and Salam (2018)), heuristics (e.g., Alonso et al. 2014), and central optimization methods (e.g., Richardson et al. 2011;Sun et al. 2018) have been tested to achieve the goal of effective EV charging management. However, in the presence of high stochasticity and the absence of perfect foresightedness, the methods mentioned above cannot converge to the optimal charging behavior (Abdullah et al. 2021). As a result, there has been an increasing interest towards more flexible data-driven approaches to model and manage the EV charging process. Data-driven approaches can be used without assumptions regarding the underlying model. They are also capable of representing the inherent stochasticities in the environment and consequently suggest probabilistic strategies that perform better than deterministic strategies over long time horizons even in adversarial settings (Wang et al. 2016).
This article presents a method and a case study that demonstrate the application of Deep reinforcement learning (DRL) to control the charging power at an AEV charging node. We demonstrate the capability of DRL to learn the optimal charging policy in a highly stochastic environment with multiple charging objectives. Moreover, we derive the state vectors based on the observations that are readily available through standard metering infrastructure in a Low-voltage (LV) network and perform online learning via policy gradient update. The main contributions of the article are as follows.
• We present a DRL solution based on the actor-critic architecture to regulate charging power at an AEV charging node considering both minimizing charging time and voltage limit violations. • The proposed solution makes use of voltage magnitude measurements from standard metering infrastructure and learns a stochastic policy that is optimal in the limit as time → ∞. • To improve the scalability of the method to much larger use-cases, we impose partial observability in the form of a local actor with a global critic. • We present a case study and based on our results, discuss the broader implications of AEV charging and the potential for future research work.

State of the art
Application of Reinforcement learning (RL) in the electro-mobility domain has attracted a lot of interest recently, leading to several published use-cases such as charging load forecasting (e.g., Zhang et al. 2021;Zhu et al. 2019), fleet assignment (e.g., Shi et al. 2020), charging station recommendation (e.g., Blum et al. 2021), and charging management (e.g., Chang et al. 2019;Wan et al. 2019;Ding et al. 2020;Dorokhova et al. 2021). Table 1 shows the summary of some exemplary studies using RL for AEV charging management. We see that the temporal resolution of the previous RL studies related to electro-mobility in Table 1 is in the hourly range. Indeed, the choice of temporal resolution depends mainly on the modeling objective. However, (Bucher et al. 2013) studied the effect of temporal-averaging in the context of LV power systems and recommends one-minute resolution for studies that make use of steady-state voltages and power flows.
The authors in the past have applied both the value-based [e.g., Q-learning, Deep Q-network (DQN), Deep double Q-network (DDQN)] and policy-based [e.g., Deep deterministic policy-gradient (DDPG)] RL methods to solve for an optimal charging strategy. (Abdullah et al. 2021) presents a review of RL-based charging management strategies available in the published literature.
In a nutshell, in value-based methods, the agent learns an approximate Q-value function through continuous interaction with the environment. Often the Q-value function is represented as a kernel function or a parameterized function approximator like a neural network. As such, the learning process converges when we iteratively update the parameters to minimize the error between the predicted and target Q-values. Policy-based methods (synonymously referred to as policy-gradient methods), by contrast, directly learn the optimal policy (denoted by π ⋆ ). Policy gradient methods have shown better convergence properties compared to value-based methods (Sutton et al. 1996); they are often capable of handling imperfect state information and able to learn stochastic policies (Peters and Bagnell 2016; Sutton et al. 1996), which are more robust than deterministic policies. One key drawback of policy-gradient methods is their sample complexity (Peters and Bagnell 2016). However, (Wang et al. 2016) shows that experience replay, first introduced during the early stages of RL, can significantly improve the sample efficiency of policy-gradient problems as well.
Actor-critic is a family of policy-gradient algorithms where two function approximators (the critic and the actor) are used simultaneously to learn the value function and optimal policy. The actor-network is a parameterized representation of the agent's current policy π . At each iteration, the agent takes an action based on the state of the environment, and its current policy, i.e., a t = π(s t ) . The critic evaluates the value of the action at the given state and updates the value function's parameters using a temporal difference update. Finally, the actor updates the policy in the policygradient direction, calculated using the critic's value estimate.
There are a variety of policy gradient algorithms published in the literature. The algorithm used in our case study is called Proximal policy optimization (PPO), which was first published in 2017 (Schulman et al. 2017). The main advantages of PPO are its simplicity and general applicability. Moreover, PPO is an off-policy learning algorithm and it is sample efficient. The original implementation of the PPO algorithm demonstrated superior performance in solving tasks with high-dimensional continuous action spaces such as half-cheetah and running humanoid robot (Schulman et al. 2017). We provide a brief mathematical introduction of the PPO algorithm for the benefit of the reader in the paragraph below. Proximal policy optimization According to the policy gradient theorem, the gradient of a stochastic policy objective J with respect to the policy parameter θ is given by ∇J (θ) = E ∇ θ log π θ (s | a)Q π (s | a) (Sutton and Barto 2018). A state-of-the-art way to reduce the variance of the policy gradient is to subtract a baseline function that does not depend on the action to not introduce bias. A common baseline function is the value function and then we can rewrite the gradient as ∇J The convergence stability of policy gradient algorithms depends on the iterative gradients updates on the policy parameters. PPO is a trust-region method that uses a clipped surrogate objective that penalizes excessively large policy parameter updates (Schulman et al. 2017).
r t (θ) in Eq. 1 is the probability ratio between the new policy and the old policy. The clipped surrogate objective (in Equation 2) clips the probability ratio outside the interval [1 − ǫ, 1 + ǫ] where ǫ is a hyper-parameter (Schulman et al. 2017).

Problem formulation
Our objective is to regulate the AEV charging power to minimize the charging time and voltage limit violations at the charging node. The power flow equations describe the relationship between power and voltage in an electrical distribution network. For simplicity, we do not consider reactive power control in our use case. However, it is important to note that the European LV grid benchmark has R/X ratios of 0.7-11.0 (Ayaz et al. 2018), which are relatively high, and at high R/X ratios, active power has the most significant influence on voltage (Blažič and Papič 2008).
The mathematical form of the objective function is given by Eq. 3. In Eq. 3, P max is the maximum charging load (maximum charging power of a charging point times the number of charging points at the node), α t = P t c /P max is the ratio between the charging load at time t and P max , N is the set of nodes in the LV grid, V m is the voltage magnitude at the charging node, and V lb is the statutory voltage limit. G ij and B ij are the real and imaginary parts of the bus admittance matrix corresponding to the (i, j) th element. δ ij is the voltage angle difference between the i th and j th buses. P i and Q i are the real and reactive power injections at node i. (1) Equation 3 is a concise way to combine both the charging power and voltage objectives. The statutory voltage limit is imposed as a soft constraint with a small allowable margin of error of ζ . In the case study, we set V lb and ζ to 0.95 and 0.01 respectively.
To solve the optimization problem with DRL, we need to define the states, actions, and reward function of the RL agent. Moreover, we employ the stochastic policy gradient approach that enables us to create an RL agent that learns the optimal stochastic policy directly from observations.
States We define two state vectors, one for the critic and another for the actor. The state vector of the actor is a local subset of the state vector available to the critic, which imposes the partial observability condition.
Critic's state, denoted by S c , is a discrete-transformed vector of voltage magnitudes (in p.u.) at each load and generator-connected bus. Given the number of load-connected buses is m, n is the number of generator-connected buses, and l is the number of bins, the critic state at time t is a vector of the shape (1, m + n, l) . We impose partial observability by limiting the actor's state to the b nearest load or generator buses from the charging node, including the charging node itself. Therefore, the actor state is a vector of the shape (1, b, l) 1 .
Transformation from continuous to a finite discrete domain is a simple but powerful state abstraction that reduces the size of the state space, improves convergence, and improves generalization properties of the model to unseen data. We recommend (Kirk et al. 2021) for more information related to the generalization of DRL models.
Actions The agent's policy yields an action at each time-step t that regulates the charging power. Therefore, we define the action of the stochastic charging agent as α t = π(s t ) . Clearly, α t is a real value in the range [0, 1] that can be represented as a random realization of a beta policy, i.e., α t ∼ Beta(a, b) . In other words, we can write the optimal stochastic policy π ⋆ = Beta(a ⋆ , b ⋆ ) where a ⋆ , b ⋆ are the optimal parameter values of the beta policy.
Reward function Reward functions require careful engineering. Efficient reward functions help guide the RL agent find the optimal policy by avoiding local optimal and improving the convergence speed (Dorokhova et al. 2021). Our problem has multiple objectives that should simultaneously minimize charging time and expected voltage violations. Therefore, following Eq. 3, we define the reward function as; (3) Model architecture The policy network parameterizes the stochastic charging policy π θ that returns policy parameters a and b of a beta distribution. Beta distribution is a bounded distribution between 0 and 1; therefore, it is well-suited for representing the stochastic charging action α t = π θ (s t ) of the agent. We encourage the reader to refer to the motivating examples (Chou et al. 2017;Petrazzini and Antonelo 2022) that describe the use of beta policy for solving policy gradient problems with bounded action spaces. Value-network (the critic) is updated based on the mean-squared error (MSE) of the critic prediction and the immediate true reward. In other words, the agent's interactions with the environment at each time-step is an episode consisting of only one step. Moreover, we implement a replay-buffer to improve the sample efficiency of the training process.
The architectures of the deep neural network that implement the actors and the centralized critic are depicted in Figure 1. The actor-network has two heads corresponding to the two parameters of the beta distribution of the stochastic policy that we need to estimate. The number of layers, layer dimensions, and layer activation functions are design choices based on hyper-parameter tuning.
(4) R(s t , a t ) = 1 |V m −V lb |≤ζ + α t 1 V m >V lb +ζ Charging power assignment So far, we have designed a mathematical formulation that enables us to optimally control the total charging power at a node minimizing expected voltage violations and charging time. The assignment problem that we discuss now answers the question of the equitable allocation of the total charging power between the multiple vehicles that require charging simultaneously. We define equity as minimizing the sum of instantaneously evaluated charging times for all vehicles. This definition allows us to prioritize more depleted AEVs and charge them faster. Consequently, we expect more AEVs to be available for users, leading to better mobility services. The nonlinear optimal power assignment problem can be written as in Eq. 5, where K ′ is the set of active charging points. Furthermore, α k ′ ,t is the charge rate of the charging point k ′ at time t, and it is a real value in the range [ ǫ , 1]. The lower-bound ǫ is a very small real value introduced for numerical stability. Figure 2 shows the combined optimization problem that we solve in iteration for each time step of the simulation.

Case study
To demonstrate the concept and methodology described earlier in the context of a shared taxi fleet, we set up a synthetic example using both real and synthetic data.
The case study consists of 216 trips within the Swiss municipality Lugano within a day. The travel data is synthetically generated using MATsim (http:// www. matsim. org), an agent-based micro-simulation framework for mobility systems simulations (Horni et al. 2016). The road network extracted from OpenStreetMaps as a graph contains all roads and links in Lugano with the importance level either residential or higher. The metadata includes distance and maximum travel speed for each edge of the graph. The resulting network has 1122 nodes and 3602 edges.
To simulate the power system impacts, we use a modified CIGRE LV benchmark grid (Fig. 3) with representative residential load profiles. The environment consists of (5)  Fig. 2 Flow diagram that shows the interconnection of outer and inner optimization problems at a given time step. We run this process in iteration for each time step of the simulation. p1... pk in the figure are the charging power at each charging point for a given time step, which can be also written as p k = α k,t=ts P k max one charging station with 11 charging points connected to the charging node (L19 in the CIGRE benchmark grid). Each charging point has a maximum charging power of 11 kW. The aggregate residential load profiles are obtained by simulating typical household appliances and devices (heat pumps and boilers, rooftop Photovoltaic (PV) generation, and non-dispatchable demand). The medium-voltage side of the transformer is connected to a constant slack bus. Given that we want to observe the effect of the charging controller in isolation, we deactivate the transformer tap changer in our simulations.
The simulated appliances and the corresponding modeling methods are as follows: • Heat-pump and boilers: To obtain a representative dataset for Switzerland, we used the STASCH6 standard (Afjei et al. 2002) and its variants as a reference for the heating system and the control logic. The STASCH6 standard comprehends three main components: a heat-pump, a water tank used as an energy buffer, and a heating element delivering heat to the building. The heat-pump control logic is based on two temperature sensors placed at different heights of the water tank, while the circulation pump connecting the tank with the building's heating element is controlled by an hysteresis on the temperature measure by a sensor placed inside the house. More details on the hydronic system modeling can be fund in (Nespoli 2019). The models' parameters, as households equivalent thermal resistance and capacitance, were tuned using data from a local pilot project, the Lugaggia Innovation Community (LIC) 2 . • Rooftop-mounted PV power plants: These were modeled using the Sandia National Laboratories' PV Collaborative Toolbox (Stein 2012), using typical inverter data. Data for the type of panels, inclinations and nominal power were taken from LIC. • Non-dispatchable demand: Non-dispatchable demand was modeled using the Load Profile Generator tool 3 , which uses a full behavioural modeling approach to generate residential load profiles. As an input of the tool we have used the same typical meteorological year used to generate the PV power plant profiles and as an input to the households' thermal models.
Note that the input to the simulation model are aggregate profiles. Consequently, the power flow model of the LV grid consists of only load (PQ) buses and we set the parameters m and n introduced in section Problem formulation to seven and zero, respectively. A discrete-time simulation environment with one minute time resolution based on SimPy (Matloff 2008) is developed to simulate the fleet of shared AEV servicing the travel requests. The fleet consists of 11 AEVs, and they are randomly located at the start of the simulation. A python generator pops a travel request when the environment time reaches the start time of a trip. A free AEV can accept that request and initiate a series of processes to service the request by (1) routing to the pickup location, (2) picking up the customer, and (3) routing to the destination. En route, an AEV can decide to charge the batteries if it senses a chance of battery depletion. Similarly, an AEV can leave the charging station during the charging process when it senses sufficient State of charge (SOC) to serve an incoming travel request. The routing is based on the shortest path algorithm, weighted by the travel time. "Go to charge" is a binomial decision based on the current SOC.
The training dataset consists of 20 days of residential load profiles covering all four seasons of the year. We add a small Gaussian noise to each residential load profile during model training to assist the stochastic charging agent to learn from similar but not identical observations at each iteration. The customer travel demand profile is identical in each day. The validation dataset consists of 10 days of residential load profiles (without added Gaussian noise) and the customer demand profile identical to the one in the training data. The model training is performed in batches of 64 randomly sampled observations from the replay buffer.
The Table 2 describes the set of hyper-parameters used in the PPO model.

Results
In this section, we present the results of the simulations we carried out and compare the performance of the stochastic RL charge controller with a simple benchmark controller. The benchmark controller is one that regulates charging power based on a droop strategy given by the function below. ζ is set to 0.01 in our case study. Table 2 Hyper-parameters of the PPO model

Hyper-parameter Value
Layers and layer dims. Figure 1 Activation functions Figure 1 Learning rate Actor: 1 × 10 −6 Critic: 1 × 10 −5 After running the training loop over 15 epochs, we observe a relatively smooth convergence of the stochastic charging agent as shown in Fig. 4. The variance bounds indicate variability of the expected reward that is high at the start of the training and then stabilizes at roughly 0.1 after 15 epochs. Note that we stopped agent training after 15 epochs, although even longer training time could have resulted in tighter variance bounds. The stochastic charging agent predicts a charging power upper bound with a mean of approximately 68% of the maximum charging power of the station (Fig. 5a).
The peak shaving effect takes place only at specific times of the day when the charging power demand exceeds the upper bound forecast of the stochastic charging agent, as shown in Fig. 6a. We also observe, in comparison to the benchmark strategy, that the stochastic charging agent enforces higher charging rates when possible (Fig. 5b). The voltage impact of peak-shaving is depicted in Fig. 6b. Over the 10-day validation period, the stochastic control strategy results in 17 instances of voltage dead-band violations (0.1% of the total observed time steps), whereas the benchmark strategy results in zero violations. However, the proposed strategy provides a 7.4% extra charging rate during the same period, on average. Furthermore, between the peak charging times (time steps 400-1000 of each day), the proposed strategy provides an additional 39.07% average charging rate compared to the benchmark strategy. Figure 7a is a graphical depiction of how the SOC, charging rates, and α t are related to each other. Firstly, we observe that the charging rates increase when the SOC are lower, which is the expected behavior of the inner optimization. However, the sensitivity of this relationship is governed by α t . If the constraint is strict (low α t ), the charging rate becomes more sensitive to the changes in SOC. Conversely, if the charging power constraint is lenient, the sensitivity of the charging rate to SOC gets lower.
The charging trajectories (profiles) describe the change of SOC of a vehicle over time (Fig. 7b). Due to the negative dependency of the charging rates on SOC, the charging profiles of the AEVs are, by default, non-linear. Charging trajectories can progress linearly only when the total charging power requirement is less than the constraint set by the stochastic charging agent and as the SOC increases towards 100%, the charging rate slows down. The non-linearity of the charging profiles exacerbates when the charging power constraint is more stringent, for example, between time steps 400-600. Figures 6a, 7b jointly enable us to visualize that when charging power demand is higher, there is more non-linearity in charging trajectories. Fig. 6 a The peak shaving effect of the stochastic charging agent, b The voltage magnitudes at the charging node with and without charging control over the 10 day validation period, sorted in the ascending order Fig. 7 a Relationship between the SOC and charging rates, b Charging trajectories of the EVs. Observe that there is a reduction of the charging rate as the SOC of the vehicle increases, particularly when the charging demand is high (e.g., between time steps 400 and 600) As a result, as the SOC of a vehicle increases beyond a certain threshold, it may become unproductive for an AEV to remain connected to the charging point, given the diminishing charging rates. As a result, this behavior provides an additional degree of freedom for intelligent decision-making and optimization. For example, we can argue that in a sharing economy, it is much better to have two vehicles at 70% SOC levels than to have one vehicle fully charged and the other one at, say, 40%. The additional degree of freedom encourages faster turnover of vehicles and can improve the use of limited charging resources. While we do not address this question in the current article, we would like to present it to the research community as a promising area to investigate.

Conclusion
This article presents a policy gradient RL based strategy to solve the optimal electricvehicle charging problem considering both charging rates and voltage violations. We formulate the problem as an optimization problem with two levels. To solve the outerlevel optimization problem, we train a stochastic agent using PPO. The inner-level is a non-linear optimization problem, subject to the boundary condition evaluated by the PPO agent. The case study presented in the article serves as a proof of concept for the applicability of stochastic RL controllers for AEV charging management in a smart-grid.
Comparison against the benchmark controller with a droop strategy illustrates that both control schemes can shave the peak demand and manage statutory voltage limit violations. In addition, the stochastic RL controller also optimizes the charging rate, reducing the total charging time. However, we observe some instances (0.1% of the entire time duration) when the statutory voltage limit gets violated under the stochastic RL control scheme. This observation highlights the critical detail that due to the probabilistic nature of decision making, there is a non-zero chance for a stochastic RL agent to make a decision that leads to an undesirable state. Since our case study is not safety-critical, we can allow a small number of instances when the voltage constraint is violated. But, it is an essential consideration for integrating stochastic RL controllers in weaker grids, which require further investigation.
There is a multitude of open research avenues extending from our work. One apparent future step is to investigate the impacts of stochastic charging control under different circumstances, such as fast charging and more complex grid topologies. Moreover, estimating the benefits to the upstream network, especially under different formulations of the control objective, is also a promising avenue for future research. Such problems are challenging for the learning process of the PPO agent, which may call for better feature extraction and state-space representations.
From an algorithmic and architectural viewpoint, understanding the benefits and drawbacks of different RL model architectures in high-resolution and partially observable environments has many practical advantages. Most current work focuses on prediction problems at low temporal resolutions. However, applying RL for real-time control problems in the smart-grid domain requires robust models that handle highly stochastic time series data.
Optimal control of AEV charging has broader consequences. If appropriately designed optimal charge controllers can be used to improve energy security, quality of mobility services, economic efficiency, and social equity, as pointed out in the case study. However, as of now, the energy, social, and economic nexus of AEV management and control is a largely untouched topic.
We believe that such research directions have tremendous value because while the smart-grid future is at our doorstep, we often need to build solutions with technical, economic, and social relevance based on partial data.