Virtualization for performance guarantees of state estimation in cyber-physical energy systems

The strong interdependence between power systems and information and communication technologies (ICT) makes cyber-physical energy systems susceptible to new disturbances. State estimation (SE) is a vital part of energy management systems, for several monitoring, management, and control services. Failure of SE service leads to loss of situational awareness, which in turn has a detrimental impact on the grid operation. Therefore, it is essential to maintain the performance of SE service. Modern technologies such as virtualization are key drivers to provide the flexibility to reallocate, reconfigure, and manage services as a countermeasure to mitigate the impact of disturbances. This paper introduces the virtualization of SE service as a potential approach to maintain its performance in the case of disturbances. Following a review of existing approaches for maintaining the performance of the SE service, a description of the proposed approach is provided. The benefits of virtualization of SE service are demonstrated via a simulation test platform with an ICT-enriched CIGRE MV benchmark grid.

State estimation is a key service responsible for the real-time monitoring of PS (Abur and Exposito 2004). It involves gathering field measurements via sensors [e.g., intelligent electronic devices (IEDs), remote terminal units (RTUs)] and processing the received measurements to estimate the system state variables, i.e., voltage magnitude and phase angle. These results are used by several other grid services, e.g., voltage control (Klaes et al. 2020). ICT disturbances such as component failures or congestion in communication network can impact the performance of SE service, and lead to incorrect control decisions. The 2003 North Eastern blackout, which was mainly caused by a software failure in the state estimator, consequently providing the system operator with inaccurate situational awareness (NERC 2004). This indicates the importance of the role of SE service. Performance degradation of SE service due to ICT disturbances can cause the PS to be in emergency state due to the resulting loss of monitoring and situational awareness . Evidently, there is a need to guarantee the performance of SE service in the face of disturbances.
SE service can currently either be executed in centralized or distributed mode (e.g., mostly predefined during the design phase), and is restricted by hardware-software setups (Krüger et al. 2018). This limits the flexibility to provide timely reaction to disturbances. While centralized SE provides an accurate estimation of state variables in comparison to distributed SE due to redundancy and interconnected measurements (Kotha and Rajpathak 2022), the major drawback associated to the centralized mode is the increased latency between the field and the SE server (Cosovic et al. 2017), as well as, existence of single critical component failures (e.g., SE server) (Alves et al. 2019). Distributed SE, on the other side, benefits from local processing, thus reducing latency and increasing the guarantee to receive measurements, especially in case of disturbances affecting the traffic load and the field measurements (Merlino et al. 2022). These paradigms with limited flexibilities show the existing trade-off between accuracy and reliability of the SE service in cyber-physical energy systems (CPES) (Grahn 2017). In this regard, grid function virtualization (GFV)  can be used to provide flexibility and enable switching the execution mode of SE service to maintain its performance.

Related work
Several solutions exist to support the SE service to withstand ICT disturbances in CPES and maintain its performance. In the case of disturbances affecting field measurements, one way to regain the observability is by using output clustering as shown in Jevti (2020). By augmenting the measurement set (i.e., which alone is insufficient for SE) by respective cluster variables, the system observability is regained, and situational awareness is maintained. In the same context, pseudo measurements are used to replace unavailable critical measurements as investigated in Pau et al. (2017). However, pseudo measurements which have a high standard deviation, may not be able to provide the system operator with the desired SE results (Hassan 2021). From an ICT perspective, the performance of the SE service is also affected by communication network disturbances such as congestion. Authors in Al-Rubaye et al. (2017), propose the optimization of communication resources usage (e.g., delay), via analyzing the quality of network traffic. While this guarantees the achievement of quality-of-service (QoS) requirements, it ignores the allocation of communication network resources based on the SE service requirements in the PS. For instance, as shown in , SE service can require more data samples in order to give better estimates in case of certain PS disturbances. Moreover, disturbances affecting the functionality of SE service like software failure (NERC 2004) are not considered. These papers only consider solutions restricted by hardware-software setup to support SE service. Providing the SE service with the required flexibility to benefit from available field measurements and servers in the grid is not yet addressed in the literature.

Contribution
This paper presents an approach for virtualization of the SE service, which enables switching its execution mode between centralized and distributed, and addresses its potential to maintain the performance of the SE service in case of disturbances. Operational states of the SE service are used to assess the impact of disturbances on the performance of SE service. The virtualization of SE service is detailed in terms of its operation and implementation. Finally, the ability to maintain the performance of the SE service is evaluated with a simulation-based proof of concept using an ICTenriched CIGRE MV benchmark grid.

Conceptual background
In this section, the foundation of the GFV concept used in the paper is summarized. This concept is an essential part of this work. GFV is used to enhance the operational flexibility by enabling grid service management. It includes three layers-infrastructure, service, and management . The infrastructure layer represents the computational resources of the hardware such as memory and processor. The service layer represents the virtualized services, which can run with flexible resource allocation. These resources are virtualized by the management layer, which is the most important part of the architecture. The management layer has two aspects, a service controller and a local manager on each grid hardware.
The service controller which is located in the control room has the global view of the various grid services. It can monitor the computational resources (e.g., Memory, CPU) of the grid hardware with the aid of ICT monitoring tools. Additionally, this layer has a service description catalog that contains information about grid services along with their QoS requirements. Information regarding communication network resources is acquired via the interface to communication network. The local manager monitors the status of virtualized grid services as well as the computational resources of hardware; and provides it to the service controller. The service controller aggregates information regarding the operation of grid services as well as the available resources. For example in the case of disturbances (e.g., hardware failures), the service controller detects abnormal behaviour in the CPU usage. Using information from the catalog, it performs decision making regarding suitable mitigating actions and sends it to the local manager. The local manager then executes these decisions (e.g., start, stop) on the virtualized grid services. Further detailed information about GFV can be found in  and Attarha et al. (2020).

SE service considering virtualization
This section presents the approach of SE service virtualization proposed in this paper. The work elaborated in this section is an exemplary case study that will be used later in the proof of concept. The CPES infrastructure is first presented followed by a description of SE service, with a focus of its execution modes and operational states. Next, the management of virtualized SE service with the aid of GFV is elaborated.

Infrastructure
In this paper, the exemplary PS model considered is the CIGRE MV benchmark grid (Alam et al. 2020). The grid consists of 15 busbars, with bus 0 considered to be the slack bus connected to the external grid. Operational technology (OT) devices (e.g., IEDs, RTUs) are located at the busbars (or sub-stations) as shown in the left side of Fig.1. Measurement data from OT devices is transmitted through the communication network to SE servers for processing.
The communication network topology for the aforementioned power grid model is designed based on the methodology elaborated in Moussa et al. (2017), as shown in the right side of Fig.1. Each substation is associated with an edge router. Since there is no power lines between bus 0, bus 1 and bus 12, these three buses are located in the same substation, and are thus associated with one edge router. Local area network (LAN) is used for local communication between the devices in the substation (e.g., sensors) and the edge router. For wide area network (WAN), two communication architectures are elaborated taking into consideration the centralized and distributed SE implementations. A central architecture is considered for centralized execution of SE, where all OT devices send their data via the core routers to the central SE server located in the control room (CR). While for distributed execution of SE, a distributed architecture is considered, where the edge routers communicate with each other via communication links if the substations are neighbors (i.e., substations that are connected by power lines). Servers for distributed SE service are assumed to be located at bus 0, bus 5, bus 8 and bus 14, respectively.

Execution modes of SE service
The SE service estimates the PS state variables, i.e., bus voltage magnitudes and angles, based on field measurements. Typical measurements are line currents, active (P) and reactive (Q) power flows in the lines, and P and Q injections at buses. In centralized execution of SE service, measurements are transmitted from the sensors to the SE server via the communication network. The processing of the received measurements using a SE algorithm is done in the control room. Failures in the control room server cause the SE service to fail in case there is no backup or redundant server. A weighted least squares (WLS) algorithm, which is presented in Thurner et al. (2018), is used in this paper for centralized execution of SE service.
For the distributed execution of the SE service, the PS model addressed in this paper is assumed to be partitioned into four areas, having n 1 = 4 buses (bus 1, 2, 3, 4), n 2 = 3 buses (bus 5, 6, 7), n 3 = 4 buses (bus 8, 9, 10, 11 ), and n 4 = 3 buses (bus 12, 13, 14), respectively, as shown in Fig. 1. Note that this partition can be changed depending on the use case. Field measurements are transmitted locally to the appropriate SE servers via the inter-substation communication. Measurements are grouped as internal or local and boundary measurements. The SE servers perform the distributed SE algorithm by processing the internal measurements and exchanging the boundary measurements with the neighbor servers. For the distributed SE service, a WLS algorithm is considered as elaborated in Korres (2010).

Operational states of SE service
The performance of SE service can be assessed using three criteria-availability, latency, and correctness (Klaes et al. 2020). These criteria are used to qualitatively assess the impact of disturbances on the operation of SE service. Availability assesses the existence of required field measurements at a given time, which is the input for the SE service. Latency is a QoS requirement of the communication network and refers to the time between transmission and reception of data. Correctness captures the quality of input measurements, which directly impacts the quality of SE service output. It can be impacted by disturbances such as noise interference or cyber-attacks. The states of SE service can be defined as follows (Klaes et al. 2020;Hassan 2021): Normal state All requirements of SE service are satisfied. Thus, SE service is fully functional and can be used by the operator to get real-time situational awareness of the PS. If no disturbances have impacted the SE service, or if the occurred disturbances have been absorbed, then the SE service is considered to be in normal state.
Limited state In this state, SE performance is partially degraded due to certain disturbances, which have impacted the availability, latency, and/or correctness. The operator is aware that the SE service needs to be used with caution as there is an increased risk of further degradation. This state characterizes the resiliency of the SE service.
Failed state This state indicates that the SE service is no longer functional, i.e., availability, latency, and/or correctness are violated. Therefore, it affects other grid services and can lead to inaccurate control actions. The operator should take suitable actions to restore the functionality of the SE service.
The operation of SE service depends on the availability of sensors, i.e., gathering field measurements, and on the SE servers, i.e., processing the field measurements and performing the SE algorithm. The solvability of a SE service requires the availability of sufficient field measurements. According to Korres (2010) and Salau et al. (2014), the typical condition for the solvability of SE service isρ(H ) = n for centralized mode, where ρ(H) is the rank of the measurement Jacobian matrix H and n the number of state variables. Similarly for distributed mode, the condition is ρ(H i ) = n i , for each area S i . Field measurements in distributed mode include local measurements gathered via sensors and boundary measurements exchanged with the neighbor servers. Due to ICT disturbances, the available field measurements may not be sufficient to fulfill the solvability condition. Note that some disturbances such as link failures can be prevented with the distributed communication network architecture due to the local processing and the availability of alternative paths. In such cases, solvability can be satisfied by substituting the missing measurements with suitable pseudo measurements m p , which are derived based on historical data and have a higher standard deviation than field measurements (Dehghanpour et al. 2018). Therefore, usingm p increases the uncertainty in SE results. Based on this, states for centralized and distributed modes of SE which provide the operator with information about its performance can be outlined as shown in Fig. 2. Note that, latency and correctness are out of the scope of this paper, and will be considered in future work.

Management
Virtualization of SE service requires management and orchestration of virtualization tasks examined throughout the life-cycle of a SE service. The GFV approach examined in Krüger (2020) is used for the management of the virtualized SE service. Service controller, local manager, and service description catalog are the main modules considered in this work. The local manager monitors the availability of virtualized SE service and the computational resources of hardware; and provides it to the service controller via the communication network. The communication network provides as well the service controller with information regarding its resources (e.g., latency). The service controller Fig. 2 States of centralized and distributed SE service then aggregates the received information, and assesses the state of the virtualized SE service. This state provide the operator with additional information about the performance of SE. In the case of disturbances affecting the SE service (e.g., software failures), the service controller detects the performance degradation of SE service to failed state. The service description catalog (cf. descriptor files in Attarha et al. 2020) provides the service controller with information about the SE service in terms of its execution mode specifications (e.g., servers location, slack buses). Using these information, it reallocates the execution mode of SE service and sends the reconfiguration actions to the local managers. The local manager executes the received actions and starts running the new execution mode of the virtualized SE service (switching from central to distributed or vice versa).

Proof of concept
The goal of virtualization of SE service is to maintain its performance in case of disturbances by enabling flexible reconfiguration and switching of its execution mode. This section presents a simulation-based proof of concept to demonstrate the benefits of the proposed approach. An overview of the test platform is first provided, followed by a description of the scenarios considering disturbance sequences. Next, the simulation results along with the state of SE service are presented and discussed. Figure 3 shows the overview of the simulation test platform. The PS is simulated using Pandapower, in which the CIGRE benchmark grid is modelled for the test case (left side of Fig. 1) (Thurner et al. 2018). ICT network components, i.e., sensors, routers, and links, are simulated using EXata, a real-time communication simulator, which interfaces the PS and the SE servers (cf. right side of Fig.1). Measurements are gathered via sensors and transmitted to the corresponding SE servers for processing. SE service in both execution modes (i.e., centralized and distributed) are implemented in Python. The centralized execution is simulated using Pandapower, while the distributed one is developed based on Korres (2010).

Fig. 3 Overview of connected tools and software in the test platform
The virtualization technology is implemented using Docker, which is a lightweight open source containerization tool. It provides the capability to build, execute, and manage containers. A Docker engine controls the local containers (Liu 2014). Each Docker engine is divided into worker and manager nodes. All SE servers are connected as worker nodes to the service controller which is the manager node. Note that in the case of disturbances affecting the service controller, the operator can reassign the service controller to another suitable node (Meadusani 2018). Service controller is responsible for the assessment of states of SE considering its requirements. Based on this state and information from the service description catalog (e.g., number of distributed servers, buses of the same area), service controller decides whether or not to switch and, if necessary, reconfigures the execution mode of SE service.

Scenarios
The proof of concept aims at demonstrating the capability of the virtualization of a SE service in maintaining its performance in case of disturbances. Disturbances in the ICT system are defined as faults associated with components (e.g., sensors, links, servers). As this work serves as proof of concept, a selected set of exemplary ICT disturbances was chosen to illustrate the resulting impact on the SE performance. Table 1 shows these selected disturbances. Sensor failures ( d s5 , d s8 ) can occur due to hardware or software problems while d link13 can result from physical damage to the communication links (e.g., fiber-optic cables). Failure in the state estimator ( d cse ) can also happen because of a software bug. This causes the SE to become unavailable or functioning too slow, and thus causes a loss of situational awareness (NERC 2004).
The case study has three simulation scenarios. In scenario 1 (baseline scenario), measurements are gathered from buses and received by SE service, which is executed in centralized mode and is located in the control room. Disturbances d s5 and d link13 cause measurements of bus 5 and bus 13 to become unavailable. In scenarios 2 and 3, the disturbance sequence considered is d s8 , d link13 , and d cse . Disturbances d s8 and d link13 cause the loss of measurements belonging to bus 8 and bus 13. Thus, the solvability condition is not satisfied. Additionally, d cse also results in loss of functionality of the central state estimator. Note that, the remaining part of the control center is not affected and is still functional. Scenario 2 (without virtualization of SE) shows the degradation of SE performance, while scenario 3 (with virtualization of SE) demonstrates the capability of the approach to mitigate the impact of disturbances by switching the execution mode of SE service from centralized to distributed. Note that in all scenarios SE service is initialized in centralized mode and its state is initialized as normal.

Results and discussion
For scenario 1, disturbances d s5 and d link13 cause measurements of bus 5 and bus 13 to become unavailable. However, the solvability condition is still met (i.e., ρ(H) = n ). This is because measurements of bus 5 and bus 13 are redundant measurements, the unavailability of which does not impact the solvability condition. Accordingly, the SE service is in normal state as proven in Hassan (2021). This shows the relevance of redundant measurements in the implementation of SE service to increase its robustness against loss of measurements . In this case, the SE is performing well and the operator can use SE results for decision making (ref. Fig.2). The simulation results for scenarios 2 and 3 are summarized in Fig. 4 along with operational states of SE service. The disturbance sequence considered in these two scenarios is d s8 , d link13 , and d cse . Disturbance d s8 doesn't affect the solvability condition due to redundancy. However, measurements belonging to bus 13 become critical. This is because measurements corresponding to bus 8 and bus 13 belong to the same critical set (Hassan 2021). Hence, the unavailability of one of the measurements makes the remaining measurements in the set critical. The next disturbance d link13 causes the loss of critical measurements corresponding to bus 13. Thus, the solvability condition is violated. In this case, m p (corresponding to either bus 8, bus 13 or bus 14) are available and can be used to fulfill the solvability condition. This causes the state of SE to degrade to limited (ref. Fig. 2). This is followed by a software failure in centralized state estimator ( d cse ), which causes a loss of SE functionality. It can be seen that in scenario 2 (without virtualization of SE service), this causes the SE state to degrade to failed. However, unlike in scenario 2, the impact of the disturbance is mitigated in scenario 3 with the virtualization of SE service. It enables monitoring SE state, which is then followed by a fast switching of its Fig. 4 Overview of the estimation results with the SE state for scenarios 2 and 3 execution mode to distributed as mitigation action. Due to the successful mitigation of the disturbance, the SE continues to provide estimated results, thereby maintaining its performance (i.e., remains in limited state). This shows the capabilities of virtualization of SE service to mitigate the impact of certain ICT disturbances, which otherwise would have affected its performance and the normal operation of the PS. Figure 5 shows the sequence of events involved with virtualization of SE regarding monitoring SE state and execution mode reallocation. Information of components (e.g., Memory) from local managers (Docker) is sent to the service controller via the ICT system simulator (EXata). The service controller (Docker) uses this information to assess the state of SE service. When a degradation of SE state is detected, an alarm is raised in the service controller. The service controller gathers information about the execution mode to which the SE service will switch along with its requirements from local managers and a local information le. Then, the service controller first decides on how to reconfigure the SE and reallocate the corresponding servers. Second, it implements this decision via corresponding local managers. The decision is to switch the execution mode and process the SE in the distributed servers.

Conclusion and future work
This paper proposes an approach for maintaining the performance of SE by switching its execution mode (i.e., centralized vs. distributed) in case of disturbances using virtualization. This is done based on monitoring the state of SE. The proposed approach has been demonstrated in a case study using the CIGRE MV benchmark grid augmented with an ICT system. Operational states which represent the SE performance are used to assess the impact of disturbances. Depending on the performance degradation, the operator can decide on switching the SE execution mode; maintaining thereby its performance. This is done with the aid of GFV that enables flexible reconfiguration of grid services and reallocation of computational resources.
The operational state of SE is one of the key indicators for switching the SE mode. The assessment of SE execution modes is analyzed in this paper based on Hassan (2021). As a next step, the assessment of distributed mode considering its specifications and implementation will be addressed in further studies. Furthermore, the scalability of the presented approach will be evaluated. Large-scale PS models that contain numerous sensors and SE servers will be used to test the capabilities of the approach. A comprehensive evaluation with a methodology (e.g., design of experiments) that evaluates systematically the performance of the approach is also planned.