The methodology proposed in this paper aims to serve as decision support for participants in local energy markets, regarding the amount of energy to be transacted, the price bided/offered for that transaction, and the use of flexibility to counter possible extra costs related to participation in P2P. In this methodology, the players representation agents can use reinforcement learning-based training to improve their participation in P2P markets using simulation environments. For this, two DRL algorithms will be used, i.e., multi-agent versions of DDPG and TD3, separately, allowing the choice of agents who will take advantage of them. In order to facilitate the use of this methodology in real contexts, it was integrated into the Agent-based ecosystem for Smart Grid modelling (A4SG), from which agents can request training, and use the resulting policies in their participation in real-time. The A4SG is a multi-agent system framework developed by the authors to digitalize the smart grid operation models. Therefore, A4SG was used to integrate the proposed methodology due to its ability to provide agents with useful mechanisms that facilitate their active participation in the smart grid, as described in the following subsection.
A4SG
The A4SG, conceived and developed by the authors, which architecture for integrating the proposed methodology is depicted in Fig. 2, combines the concepts of multi-agent systems (MAS) and agent communities (ACOM) to produce an ecosystem in which multiple agent-based systems can coexist and interact. ACOMs are smaller groupings of agents that can represent aggregation entities, such as energy communities. The use of several groups of agents allows a distributed and intelligent decision-making process, with the integration of different services in the groups, considering their objectives. Furthermore, this ecosystem takes advantage of two novel mechanisms, i.e., branching and mobility, to improve the agents' context and performance. The A4SG ecosystem is built on top of the Python-based Agent Communities Ecosystem (PEAK) (https://www.gecad.isep.ipp.pt/peak), and the Smart Python Agent Development Environment (SPADE) (Palanca et al. 2020), which enable the agents communication and distributed execution. Besides that, it uses as graphical interface the Citizen Energy Communities Operator System (CECOS) (Pereira et al. 2021), that enables the access to useful services, such as tariffs management, and demand response simulation.
As agents may have different objectives simultaneously, that involve the engagement in multiple ACOMs concurrently or even subscribe to various services, the branching mechanism was developed to offer this capability to the agents of the ecosystem. The branching of agents is the technique that permits the deployment of a new branch agent that acts as an extension of the representation agent to achieve a specific objective. There are two types of branch agents: the goal-oriented and the service-oriented agent. The goal-oriented tries to achieve an objective, which might be, for instance, the subscription of a service or the participation in an ACOM. On the other hand, the service-oriented agents provide services to other agents. In the context of this work, branching is important, since it allows an agent to have a representation in an energy community, and simultaneously deploy a goal-oriented agent to perform the RL training, only having the objective of returning the trained policy to the agent that later participates in the P2P market.
The agents’ mobility in A4SG is divided in two types: the physical mobility, and the virtual mobility, supported by the Computation Load Balancing Mechanism and the Virtual Mobility Mechanism, respectively. The physical mobility within the ecosystem enables agents to move to a different physical location, for instance a different host, such as a computer or a server. From an individual point of view, the primary advantage is the convenience that mobility may provide to the entity represented by the agent. From the ecosystem standpoint, the Computation Load Balancing Mechanism makes use of the physical mobility in A4SG to balance the available hosts in the ecosystem in terms of computation load. In this type of mobility, the destination host’s main agent is responsible to confirm the mobility, enabling the consideration of existing constraints, such as communication, or physical resources available. The Virtual Mobility Mechanism, more important in the context of this methodology, enables agents to move to other agent communities (e.g., energy retailers) deployed on the same physical host, in order to take use of their services, interact with other agents, or get access to shared resources at the destination entity (e.g., citizen energy community). Agents that make use of this type of mobility can engage in aggregation entities that are a good fit for their profiles, bringing them closer to realizing the full potential of their energy resources. The primary distinction between virtual and physical mobility is that virtual mobility occurs within the same physical host, eliminating the need for the agent to restart its execution. In the context of this work, an agent can, for instance, enter both RL training ACOMs, and perform a training with few iterations to understand which algorithm best suits its profile, and from there, move to the ACOM that will bring it better results.
Reinforcement learning training
The reinforcement learning training in the proposed methodology focuses on two main blocks, the environment and the agents. The environment incorporates the P2P model used and provides agents with customized observations for each one. The agents receive the observations from the environment, compute the action to take, determined by the policy or exploration mechanism, and then execute the after market phase. The architecture of the methodology is shown in Fig. 3. As can be seen, although there are several agents, with actions decided by themselves, these are centralized when entering the environment and the P2P market, in order to guarantee the integrity of the environment. After training, only the policies developed by each agent are returned to A4SG agents.
Regarding RL, both algorithms, i.e., TD3 and DDPG, will be used under the same conditions, that is, with the same types of observations, actions and rewards calculated in the same way. Regarding the observation of the state of an agent, this includes several important factors for the decisions of the players when participating in the market. The observation for player \(p\) in period \(t\) is given by:
$${o}_{t}^{p}=({Forecast}_{t}^{p}, {Flexibility}_{t}^{p}, {Transactions}_{t-1}^{p}, {PeriodTime}_{t}, {ToU}_{t}, {FiT}_{t})$$
(2)
where \({Forecast}_{t}^{p}\) is the demand forecast of player \(p\) for period \(t\), in kWh, \({Flexibility}_{t}^{p}\) is the forecasted flexibility of player \(p\) for period \(t\), also in kWh, \({Transactions}_{t-1}^{p}\) is the list of transaction made by player \(p\) in period \(t-1\) in the P2P market, including information about the price and quantities of energy transacted, and \({PeriodTime}_{t}\) provides information about the period of the day represented by period \(t\).
The agents' actions are related to the strategy that each one of them develops to participate in the P2P market. Thus, each agent generates two different actions, on regarding the price, and the other regarding the amount of energy to transact, both in the interval [0, 1], representing a percentage value. Thus, the actions of each agent are given by:
$${a}_{t}^{p}=({aPrice}_{t}^{p}, {aQuantity}_{t}^{p})$$
(3)
where \({aPrice}_{t}^{p}\) represent the action relative to the price to pay for energy, \({aQuantity}_{t}^{p}\) is the action that indicates the amount of energy to trade in the P2P market, both represented in percentual points, regarding period \(t\) and player \(p\).
Regarding the proposed exploration mechanism, two types of exploration are used, in order to create a greater range of actions considered. The exploration mechanism is activated from a completely random value, generated in the interval [0,1]. If the value is lower than 0.8, then the actions that were chosen according to the policy are applied without any change. If the value is equal or higher than 0.8 and lower 0.9, then exploration with gaussian noise is activated. And finally, if the value is equal or higher than 0.9 then completely random values are used for all actions. Noise exploration explores actions values relatively close to ideal according to the policy, and random explores any value within the considered ranges.
The truth is that the actions that the policy or the mechanism of exploration of agents generate, for that reason alone, do not have much meaning. In this way, agents must frame these actions in their context, to generate the offers and the strategy to participate in the market. Regarding the price, so that both buyers and sellers feel motivated to participate, the prices offered/asked are limited to the purchase and sale price of energy on the grid. In what regards the energy amount to transact the agents consider the forecast error, and as that, the first step is to determine the potential error in a period. This error is computed using the evaluation metrics of the algorithm for forecasting at the time of testing. If it is the Mean Absolute Percentage Error (MAPE), it must be multiplied by the forecasted value for the period in question; if it is the Mean Absolute Error (MAE), it is used as the error's direct value. After calculating the error, the value of \({aQuantity}_{t}^{p}\) is applied within the forecast's possible range. As such, the price to pay and the amount of energy to be transacted are given by the following equations:
$${BidPrice}_{t}^{p}={aPrice}_{t}^{p}*\left({ToU}_{t}-{FiT}_{t}\right)+ {FiT}_{t}$$
(4)
$${Error}_{t}^{p}=\left\{\begin{array}{c}{MAPE*Forecast}_{t}^{p}, if Metric=MAPE\\ MAE , if Metric=MAE\end{array}\right.$$
(5)
$${BidQuantity}_{t}^{p}={aQuantity}_{t}^{p}*(\left({Forecast}_{t}^{p}+{Error}_{t}^{p}\right)-\left({Forecast}_{t}^{p}- {Error}_{t}^{p}\right))+ \left({Forecast}_{t}^{p}- {Error}_{t}^{p}\right)$$
(6)
where \({BidPrice}_{t}^{p}\) represents the price to pay in the P2P market, in EUR/kWh, \({Error}_{t}^{p}\) is the mean error of the forecast of the player, in kWh, and \({BidQuantity}_{t}^{p}\) is the amount of energy to transact in the P2P market, in kWh, all regarding player \(p\) in period \(t\).
Bearing in mind that in this methodology the hour-ahead market is used, there is a need of energy forecasts models to carry out the market, and not real values. Therefore, the true impact can only be measured in the period after the transactions are carried out, when the real values of consumption and generation are known. Thus, as shown in Fig. 4, in period \(t\) the data related to the transactions carried out are stored, while in period \(t+1\), interactions with the grid to buy or sell energy are carried out, and the reward for period \(t\) is calculated.
Regarding the calculation of the reward, it is directly linked to the savings made by the player with the participation in the P2P market. The first step is to calculate the cost or profit of buying or selling the energy to the grid (i.e., \({CostGrid}_{t}^{p}\)), where the actual demand of the player is multiplied by the corresponding market price, represented in Eq. (7). Then, the next step is to calculate the money transacted in the P2P market (i.e., \({CostMarket}_{t}^{p}\)), that is given by the summatory of price multiplied by energy transacted in each deal of the market, as represented in Eq. (8).
$${CostGrid}_{t}^{p}={Demand}_{t}^{p}*\left\{\begin{array}{c}{Price}_{t}^{Buy}, if {Role}_{t}^{p}=Buyer\\ {Price}_{t}^{Sell},if {Role}_{t}^{p}=Seller\end{array}\right.$$
(7)
$${CostMarket}_{t}^{p}= \sum_{i=0}^{N}({TransactedEnergy}_{i}* {Price}_{i})$$
(8)
Even with market transactions, the interaction with the grid to buy/sell energy from/to the grid is almost inevitable. This is because, using the forecast as a basis for the amount of energy to be transacted, there will always be errors, even if small, that make this interaction mandatory. The amount of energy to buy/sell from/to the grid (i.e., \({EnExtra}_{t}^{p}\)) is given by the Eq. (9) and is the difference between the real demand and the amount of energy traded in the market. In order to try to reduce the cost of interacting with the grid, when it is necessary to buy, that is, when a buyer does not transact enough energy in the market, or when a seller transacts more energy, flexibility is used to reduce costs. Equations (10) and (11) describe the process of calculating how much flexibility is needed, and the cost associated with this interaction in the after market phase. In Eq. (10) the amount of flexibility is given by the minimum between \({Flexibility}_{t}^{p}\) and \({EnExtra}_{t}^{p}\), both regarding player \(p\) in period \(t\), in kWh. On the other hand, the result of Eq. (9) and (10) are used to calculate the cost of buy/sell energy to grid in the after market phase. The flexibility is used only in the periods that demands the buying of more energy from the grid.
$${EnExtra}_{t}^{p}= \sum_{i=0}^{N}({TransactedEnergy}_{i})-{Demand}_{t}^{p}$$
(9)
$${UsedF}_{t}^{p}= \mathrm{min}({EnExtra}_{t}^{p}, {Flexibility}_{t}^{p})$$
(10)
$${CostExtra}_{t}^{p}=\left\{\begin{array}{c}{EnExtra}_{t}^{p}*{FiT}_{t} , if{Role}_{t}^{p}=Buyer AND {EnExtra}_{t}^{p}\ge 0\\ {EnExtra}_{t}^{p}*{FiT}_{t} , if{Role}_{t}^{p}=Seller AND {EnExtra}_{t}^{p}<0\\ \left({EnExtra}_{t}^{p}-{UsedF}_{t}^{p}\right)*{ToU}_{t},if{Role}_{t}^{p}=Seller AND {EnExtra}_{t}^{p}\ge 0\\ \left({EnExtra}_{t}^{p}-{UsedF}_{t}^{p}\right)*{ToU}_{t}, if{Role}_{t}^{p}=Buyer AND {EnExtra}_{t}^{p}<0\end{array}\right.$$
(11)
Finally, the reward is calculated by measuring the impact of participation in P2P in reducing costs or increasing profits, so there is a differentiation in the formula for sellers and buyers. The equation that gives the reward is then given by:
$${r}_{t}^{p}=\left\{\begin{array}{c}{CostGrid}_{t}^{p}- {CostMarket}_{t}^{p}+ {CostExtra}_{t}^{p} , if{Role}_{t}^{p}=Buyer\\ {{CostMarket}_{t}^{p}-CostGrid}_{t}^{p}- {CostExtra}_{t}^{p}, if{Role}_{t}^{p}=Seller\end{array}\right.$$
(12)
In order to facilitate the development and integration of the methodology in the A4SG ecosystem, the OpenAI Gym toolkit (Brockman et al. 2016) and the Ray RLib library (Liang et al. 2017) were used. The OpenAI Gym toolkit enables research, development, and application of RL. It integrates a large number of well-known tasks that expose a common interface that allows direct comparison of the performance results of various RL algorithms. In addition, the environments that follow the OpenAI Gym settings and requirements are often efficient in training processes that involve a high number of iterations. The Ray RLlib library provides the implementation of several RL algorithms, such as the one used in the proposed methodology in this paper, i.e., DDGP and TD3. If the environments where the algorithms are applied are OpenAI Gym-compliant, then the integration between the two libraries is quite straightforward since the agents that this library provides already allow and aim to make this connection.