- Research
- Open access
- Published:
Comparison of deep learning algorithms for site detection of false data injection attacks in smart grids
Energy Informatics volume 7, Article number: 71 (2024)
Abstract
False Data Injection Attacks (FDIA) pose a significant threat to the stability of smart grids. Traditional Bad Data Detection (BDD) algorithms, deployed to remove low-quality data, can easily be bypassed by these attacks which require minimal knowledge about the parameters of the power bus systems. This makes it essential to develop defence approaches that are generic and scalable to all types of power systems. Deep learning algorithms provide state-of-the-art detection for FDIA while requiring no knowledge about system parameters. However, there are very few works in the literature that evaluate these models for FDIA detection at the level of an individual node in the power system. In this paper, we compare several recent deep learning-based model that proven their high performance and accuracy in detecting the exact location of the attack node, which are convolutional neural networks (CNN), Long Short-Term Memory (LSTM), attention-based bidirectional LSTM, and hybrid models. We, then, compare their performance with baseline multi-layer perceptron (MLP)., All the models are evaluated on IEEE-14 and IEEE-118 bus systems in terms of row accuracy (RACC), computational time, and memory space required for training the deep learning model. Each model was further investigated through a manual grid search to determine the optimal architecture of the deep learning model, including the number of layers and neurons in each layer. Based on the results, CNN model exhibited consistently high performance in very short training time. LSTM achieved the second highest accuracy; however, it had required an averagely higher training time. The attention-based LSTM model achieved a high accuracy of 94.53 during hyperparameter tuning, while the CNN model achieved a moderately lower accuracy with only one-fourth of the training time. Finally, the performance of each model was quantified on different variants of the dataset—which varied in their \({\text{l}}_{2}\)-norm. Based on the results, LSTM, CNN obtained the highest accuracy followed by CNN-LSTM and lastly MLP.
Introduction
The development of smart grids has enabled safe and efficient power transfer from generators to consumers. By integrating information and communication technologies (ICTs) such as sensors and Internet of Things (IoT) devices, two-way communication between the grid and consumers is facilitated, making consumers active participants in the power consumption process. This enhanced control empowers consumers, while the electric grid gains increased flexibility to adapt to power requirements through a constant data stream. With the transition to renewable energy, smart grids have become crucial due to their ability to reduce the risk of power outages, given the intermittent nature of renewable energy sources.
Various incentive and pricing-based mechanisms exist to balance demand with the available power supply. Smart grids allow for more granular and real-time control over pricing in an open power market, allowing for effective management of peak power demand and reducing pricing volatility over time. In short, smart grids equip cities to utilize energy sources with a high proportion of green energy while effectively managing consumer demand in real-time.
Moreover, smart grids exhibit superior reliability and resilient to blackouts compared to conventional grids. In the latter case, when the transmission line, the load is shifted to other lines, potentially causing a cascading effect of transmission line failures, and ultimately crippling the entire grid (Chopade and Bikdash Mar. 2016). In contrast, smart grids can detect and isolate faults, preventing them from affecting the entire power grid. The isolated sections of the grid could then be diagnosed easily because of the two-way communication enabled by the ICT infrastructure. Additionally, the distributed nature of smart grids allows customers to become electricity producer via renewable energy sources via net-metering (Brown and Zhou 2019). In such cases, customers are referred to as “prosumers” rather than consumers within the smart grid framework. Figure 1 illustrates the differences between conventional and smart grids.
Due to the benefits mentioned above, substantial investments have been made in the development of smart grids systems. However, this technology’s development is still in its early stages, especially when it comes to security, making it vulnerable to various cyber-attacks. One of the most critical and harmful cyber-attacks is FDIA. To counteract these attacks, several methods have been deployed, one of which being BDD.
BDD uses state estimation (SE), which removes duplicate data and incorrect measurement, and state variables (SVs), such as magnitude and phase voltage, to detect anomalies at the distribution end (e.g., electricity theft).
SE is performed at each meter/node to obtain the non-noise part of measurements, i.e., SVs. These SVs are transmitted for appropriate action at the control room, such as adjusting the voltage/angle for the general area. Mathematically, for n-dimensional measurement \(\text{z}={\left({\text{z}}_{1},{\text{z}}_{2},\dots ,{\text{z}}_{\text{n}}\right)}^{\text{T}}\) and the m-dimensional system state \(x={\left({x}_{1},{x}_{2},\dots ,{x}_{m}\right)}^{T}\), the relationship is given in Eq. (1).
where \(e\) represents the noise and \(H\) is a matrix (for DC state estimation) of partial derivatives unique to each power system. The framework is adapted from Wang et al. (2020). A straightforward approach to detecting anomalous values via state estimation is to compute the residual (r) by the \({l}_{2}\)-norm between \(z\) and \(Hx\) and compare it against a pre-defined threshold (\(t\)) as shown in Eq. (2).
However, BDD is vulnerable to manipulation, exposing the smart grids to cyber-attacks. The purpose of FDIA is to manipulate the state estimation and make it acceptable as per the criteria of BDD. Mathematically, the modified attack vector can be expressed by (3) with \(c\) being a nonzero vector of normal states.
However, the new state estimation can be described as in (4) by changing the value of z, whereas the attack vector\(a=Hc\):
BDD could detect a random/typical change in the value of z. To detect FDIA, it is required to find an attack vector that is equal to \(Hc\). This works by ensuring that the \({l}_{2}\)-norm of the residual, as given in Eq. (2), remains the same. Mathematically, it is described in Eq. (6) by substituting (3) and (4).
If \(a\) is generated by fulfilling Eq. (5), a new comprised state estimation \(\widehat{x}\) is considered correct as per BDD and hence, the attack took place and went unnoticed.
With the advancement of AI and deep learning techniques, data-driven algorithms for FDIA have become very popular recently. Many research studies have been published focusing on deep learning techniques, particularly due to the availability of high computing power. Based on the literature review, which will be shown in Sect. "Literature Review", several models have recorded high accuracy and demonstrated good performance. However, some of these models require significant resources. Therefore, decision-makers need a comprehensive comparison of the commonly high-performing models to understand their generalization, robustness, and computational requirements. Therefore, this work contributes to the development of secure smart grids which are resilient to FDIA in the following ways:
-
Identify robust deep learning models and their architectures to detect the attack at every node/site in the smart grid (using row accuracy).
-
Quantify the trade-off between computational requirements and performance of different deep learning algorithms for FDIA detection—at the granularity of each node in the smart grid.
-
Benchmark the performance of deep learning algorithms under the different scenarios of severity of attacks (indicated via \({l}_{2}\)-norm).
The paper is organised as follows. Sect. "Literature Review" briefly describes the previous works in the field of FDIA detection. The background and methodology are discussed in Sect. ‘‘Background And Methodology’’. Sect. "Experiments and Results Discussion" explains the results obtained. Finally, the paper is concluded in Sect. "Conclusion".
Literature review
Various research work have taken different approaches to classify the types of attacks in smart grids. For instance, (Musleh et al. 2020) accounted for the attack delivery method to come up with the classification, i.e., Cyber-based, network-based, communication-based, and physical-based attacks. On the other hand, (Cui et al. 2020) took a more holistic approach to account for all types of false data attacks, including FDIA, such as replay attacks (Ding et al. 2018) and zero dynamic attacks (Teixeira et al. 2015). From the perspective of algorithm type, being used for FDIA detection, a spectrum of techniques/approaches exist in the literature. Majorly, they could be divided into two categories i.e.: model-based and data-driven detection algorithms (Musleh et al. 2020). A comparison between the two is given in Appendix, Table 3.
Model-based algorithms for FDIA detection
These models do not require prior training but only rely on the processed data and system configuration/parameters (Musleh et al. 2020). These could be further divided based on the factor if the changing nature of system configuration is accounted for in the modeling, i.e., Quasi-static and Dynamic models. The difference between the two is that in dynamic models, the system’s state is dependent not only on the present measurements/data but also on the previous states. Mathematically, described in Eq. (6).
where \(h\) is the non-linear power flow equation, \(f\left(.\right)\) relates the value of state variable \(x\), from the previous time (\(t-1)\) to the current time \(t\). \(e\) and \(v\) are the error terms. Vector Autoregression (VAR) (Shi et al. 2018), and Kalman filters (KF) (Kurt et al. 2019) are two such examples. For Quasi-static models, some examples include using Media Filtering (MF) (Lukicheva et al. 2018) and Kriging Estimator (KE) (Kallitsis et al. 2016). This work is more concerned with the state-of-the-art deep learning approaches for FDIA detection, and those approaches fall majorly in the category of data-driven algorithms. Section ‘‘Data-Driven Algorithms for FDIA Detection’’ describes those algorithms in detail.
Data-driven algorithms for FDIA detection
Compared to the last approach, this category of algorithms is not dependent on the manual setting of model parameters. Instead, the algorithms learn the best suitable parameters from the dataset—specific to that problem. All machine learning, specifically deep learning algorithms, fall into this category. This dependency on historical data leads to higher memory and computational cost. However, the advantage of these models is that they are more scalable and give a state-of-the-art performance, which is essential in operationally critical applications such as smart grids. In the current decade, deep learning algorithms have surpassed most traditional algorithms' performance (Chauhan and Singh 2018). That's why while exploring the literature, we focus mainly on deep learning approaches. An overview of different methods proposed in the literature is given in Table 1.
In ref (Niu et al. 2019a) proposed a deep learning-based framework to detect FDI attacks using a hybrid RNN-CNN model. It used the dataset generated from the IEEE-39 power system and fed the time-series data meter measurements and network traffic features. The granularity of predictions was the entire power system, i.e., if the whole system has been compromised or not. The detection accuracy was reported at 90% for the most aggressive attack. However, the accuracy decreases significantly when the aggressiveness, measured with respect to the number of compromised buses, decreases. It achieved a detection accuracy of 90%.
In ref (Wang et al. 2020) proposed a deep learning-based locational detection (DLLD) network to detect FDI attacks for IEEE-14 and IEEE-118 power systems. The dataset was generated by (Bi and Zhang 2014). It utilized the CNN model and documented results varying the number of layers used in the network. In contrast to Niu et al. 2019a, the granularity of predictions was set to each meter, i.e., it could distinguish the exact location of node(s) being compromised. In terms of row accuracy (RACC), defined in the section on the evaluation metrics, DLLD gave 97% and 93% for IEEE-14 and IEEE-118 power systems, respectively (Mukherjee et al. 2022). Expanded the similar work, for the purpose of locational detection, for the IEEE-118 bus system. They experimented with the CNN model and hybrid models i.e., CNN-LSTM, CNN-GRU. The reported RACC was 93% for IEEE-118 bus system, which was the same as reported by (Wang et al. 2020). The development of hybrid models, for the purpose of FDIA detection, was also a contribution from this work but the reported RACC value for these models was only 76%, which is very less as compared to the performance of, already established, CNN. Similarly, this work also documented results varying the number of layers used in the network but not in terms number of neurons in each layer.
In ref (Esmalifalak et al. 2017) used an SVM model (which falls in the category of the data-driven algorithm) and an anomaly detection algorithm (which is a model-based algorithm) to detect the stealth FDI attacks in the IEEE-118 bus system. The distributed SVM was based on the multiplier’s alternating direction method. For lower computational complexity, the anomaly detection algorithm used the deviation in measurement to detect the attacks instead of learning from the historical data. The granularity of prediction was set to the entire power system. The detection accuracies of 82% and 78% were reported using SVM and anomaly detection algorithms, respectively.
Recent work in the literature has expanded this application into various domains of deep learning, such as unsupervised learning and transfer learning. An unsupervised learning technique called Attention-based auto-encoder anomaly detector (A3D) was proposed by (Kundu et al. 2020). It used the real-world power flow data from the Texas grid, and patterns of its temporal evolution were simulated on the IEEE-14 bus system. The granularity of predictions was set to each individual meter. The \({F}_{1}\)—score of 94% was reported. Being an unsupervised learning technique, this approach needed neither the variety of data nor its labels from the training data. However, this work does not expand on larger bus systems and the review of literature also suggests that the development of unsupervised techniques for FDIA detection is still in the initial stages of research (Xu et al. 2104) utilized transfer learning to consider the dynamic nature of the real-world transmission line parameters. The simulated data on the IEEE-14 and IEEE-118 bus system was considered as the source domain and the power system, with real-world variation in transmission line parameters, was considered as the target domain. Prediction granularity was set to each meter—like (Wang et al. 2020; Mukherjee et al. 2022; Kundu et al. 2020). A deep neural network (DNN) used both the simulated and real-world data for pre-training, while only the latter one was used for fine-tuning. It gave the highest detection accuracy of 99.99% but other metrics such as \({F}_{1}\)-score or RACC were not reported.
The use of a convolutional Neural Network (CNN) and a Long Short-Term Memory (LSTM) network were proposed by (Niu et al. 2019b) and (Mukherjee 2023) which were tested on the IEEE-39 bus system. They used a combined attack detection mechanism using a static detector, being a state estimator, and a deep learning-based scheme. The proposed scheme had a detection accuracy above 90% for high \(\frac{k}{n}\), with k being the number of injected measurements and n is the total number of measurements.
In ref (He et al. Sep. 2017) proposed a Conditional Deep Belief Network (CDBN) (Feng et al. 2024) which can detect a type of FDIA that can bypass the SVE mechanism, from real-time measurements. To evaluate the performance of the proposed scheme, simulations were conducted. As a result, in comparison to models such as ANN-based and SVM-based, the scheme is more resilient to different numbers of the attacked measurements, the different detection thresholds of SVE, and some levels of environment noise levels.
In ref (Mukherjee et al. Nov. 2022) proposed a real-time FDIA identification system valid on IEEE 14-bus system. This method utilizes the error covariance matrix. The architecture uses a nonlinear LSTM structure which is comprised of 8 hidden layers, two of which are the input and output layers. The results of this paper showed an accuracy of 95% on detecting the presence of the attack.
In ref (Bitirgen and Filik Mar. 2023) used CNN-LSTM with particle swarm optimization on a phasor measurement unit dataset. This dataset was obtained from an open-source simulated power system. This system proved to be the most accurate with 98.94% when compared to the other metrics.
An extremely randomized trees algorithm was proposed by (Majidi et al. Jul. 2022). Five hidden layers were used with varying neurons depending on the bus used. The dataset was simulated on the IEEE-14, IEEE-30, IEEE-57, and the IEEE-118 power systems. Due to the scarcity of real data, datasets were generated randomly within a 30% range. A stacked autoencoder was used along with additional trees classifier as this decreases the computational complexity. The results showed 98%, 94%, 99%, 99% for power systems IEEE-14, IEEE-30, IEEE-57, and the IEEE-118 respectively.
In ref (Li et al. Nov. 2022) proposed a deep learning by combining Paillier cryptosystem with federated learning. The power systems used to test this architecture are IEEE-14 and IEEE-118. All nodes were able to jointly train a detection model while maintaining the privacy of all the local training data thanks to the usage of a Transformer-based model. It was concluded that the Federated learning proved to be more secure as well as more accurate than the CNN and LSTM algorithms. The proposed method showed a precision of 99% for the IEEE-14 power system while the CNN and LSTM were 98% and 0.99% respectively. This method also showed a precision of 82% for the IEEE-118, while a 60% and 61% for CNN and LSTM, respectively.
Based on the reviewed studies, CNNs, LSTMs, CNN-LSTM hybrids and attention-based model have proven high performance in FDIA across various power systems. Hence, we will extensively explore them in this study and compare them to MLP model as a baseline to examine their effectivity through a fair comparison.
Background and methodology
Data-driven algorithms for FDIA detection
FDI detection problem was framed as a multi-label classification problem—inspired by (Wang et al. 2020). The power system data was taken at the granularity of each node, and the training data had the ground truth and status of each node, i.e., if it was compromised or not, one-to-one mapped. The output format was a vector of binary values equal to the length of the number of measurements in the indicated bus system.
The task of the deep learning model was to learn the relationship between these two values using irregularity in the value of the directly mapped node and the connection of that node to the nearest ones. This framework is dependent only on the measurement readings data without the need for any previous state or system parameters. As explained in the next section, this framework was tested for several deep learning models. For any model, the presence of attack could be calculated using Eq. (7)
where \({O}_{i}\) indicate the output of the model at the \({i}^{th}\) node. In our case, the threshold value (\(t\)) of 0.5 was used.
Deep learning models used
The theoretical explanation of selected deep learning models is provided here.
Multi-layer perceptron (MLP)
MLP is a feedforward artificial neural network (Bebis and Georgiopoulos 1994) that consists of multiple layers of neurons. It consists of input, output, and single/multiple hidden layer(s). In MLP each neuron is connected to all the neurons in the following layer. These models can classify non-linear data through their structure of layers and the non-linear activations after every neuron. Our models consist of two dense layers with 180 neurons and include the non-linear activations; ReLU and sigmoid.)
Convolutional neural network (CNN)
In contrast to MLP, CNN connects to only a few spatially adjacent neurons. During forward pass, a matrix of shared weights, known as kernels/filters, slides over the input data (or over the preceding layer), and during the backward pass, these kernels are adjusted. As compared to the traditional approaches, the kernels learned are automated instead of being handcrafted. The CNN model architecture can regularize the neural network and prevent it from overfitting and has efficient memory utilization as the weights/kernels used are shared by multiple neurons. Architecture summary of the model used is shown in Appendix, Table 4. It comprises 1-dimension convolution and ReLU activation layers. Dropout layers were included as well to reduce the overfitting.
Long short-term memory (LSTM)
In contrast, LSTM belongs to the category of RNN (He et al. 2017). These networks are specifically designed to deal with sequential data such as text and time-series data. Their advantages include superior learning and less vanishing gradient issue. In contrast to RNN, in LSTM, three gates control the flow of information into and out of the main cell. This gives the ability to the cell to remember information over an arbitrary length of time without running into the vanishing gradient problem. These gates are termed as forget gate, input gate, and output gate. An illustration of a single LSTM cell is given in Fig. 2.
Forget gate deletes/reduces the irrelevant information from previous time steps. It takes as input the previous cell state (\({h}_{t-1}\)) and input (\({x}_{t}\)). Its output ranges from 0 to 1. The amount of information taken from \({C}_{t-1}\) is proportional to this value. Mathematically, this gate could be represented using Eq. (8)
where \({W}_{f}\) and \({b}_{f}\) represent the weights and biases of the layer comprising of forget gate. \(\sigma \) means sigmoid activation. Secondly, the LSTM model could save the new relevant information in the cell. There are two components required to make it work, i.e., a similar sigmoid layer, as used in the forget gate (which suggests the values that will get updated) and a \(tanh\) layer (to get new potential values, that could be saved in cell state). Mathematically, these two could be represented by Eqs. (9) and (10).
where \({CC}_{t}\) represents the list of potential values that could be added to the cell state. Now, we have the components ready and using Eq. (11), a new state \({C}_{t}\) could be calculated.
Finally, to calculate the output of the LSTM cell at the current timestep, as given in Eq. (8), the cell state, \({o}_{t}\) (after applying \(\text{tanh})\) is multiplied with the output of the sigmoid gate, i.e., Eq. (12).
where \({h}_{t}\) represents the final output from the cell. Finally, to prevent overfitting, two dropout layers were incorporated in the model design. The architecture summary of the model used is shown in Appendix, Table 5.
Attention-based Bi-directional LSTM
In our case, the attention-based models are an extension of the traditional LSTM model (RNN in general). By enabling the network to learn where to pay attention to the input to get the target values/sequences, it overcomes the constraint of fixed-length representation in the traditional LSTM design. In other words, it focuses on the steps where the relevant information is concentrated (Bebis and Georgiopoulos 1994). This results in significant computational requirements that could be very useful for smaller bus systems such as the IEEE-14 bus system. In the model design, dropout layers were included to reduce overfitting. The final architecture summary of the model is presented in Appendix, Table 6.
Hybrid CNN-LSTM Model
This architecture choice is inspired by Mukherjee et al. (2022). First, a few convolutional layers extract features which are then fed to a single dense layer. Then this dense layer is followed by the LSTM layer(s). Finally, the output from LSTM layers is flattened and passed to a dense layer with sigmoid activation. The activation function is \(relu\) and \(tanh\) for convolutional and LSTM layers respectively. The exact configuration for both has been previously described. The model also includes dropout layers to mitigate overfitting as shown in the architecture summary of the model presented in Appendix, Table 7.
Based on the description of each model, the models are unique and each one is characterized by specific strengths. CNNs can focus on spatial data and are suitable for locational FDIA detection, with high overfitting mitigation. LSTMs focus on patterns and sequences in power systems, while attention-based LSTMs enhance this capability by paying attention to details. The Hybrid LSTM-CNN leverages the strengths of both models, covering broader data patterns. Finally, MLP serves as the baseline non-linear model, providing a useful reference for this study.
Dataset
This work builds on the dataset generated by (Wang et al. 2020) using IEEE-14 and IEEE-118 bus power systems, and it is publicly available (GitHub - wsyCUHK, WSYCUHK_FDIA2023). They simulated the data using MATPOWER, where the topologies of the two systems, and the generated dataset mimics the real-word scenarios with realistic load scenarios and several noise levels as will be detailed in this section. The IEEE-14 bus system consists of 20 transmission lines and 15 buses, with 19 measurements consists of 11 flow measurements and 9 injected ones, represented by 19 independent features in the dataset. Similarly, IEEE-118 bus system has 180 transmission lines and 118 buses. It includes 110 flow measurements and 70 injected values, totalling 180 features in the IEEE-118 dataset.
These nodes are indexed based on the network topology, i.e., closers nodes are indexed together and vice versa. This is because the readings at any node are affected by the neighbouring nodes. This relationship is captured well by the CNN—as explained later in the results section.
Five different variants of the dataset, for each power bus system, were generated, which varied in terms of the \({l}_{2}\)-norm of the injection data. Its value ranged from 1 to 5. Lower value of \({l}_{2}\)-norm indicates that the attack is more subtle, i.e., the attacked nodes are only minorly compromised; thus, this makes it difficult to detect these attacks. On the other hand, a higher \({l}_{2}\)-norm suggests that the attacker is more desperate and is detected (relatively) easy. This dimension of data could tell us how well the solutions developed for FDIA detection will hold against the attackers in different practical scenarios. In all the experiments, \({l}_{2}\)-norm has been set to 1. The univariate distribution of the number of compromised nodes for an IEEE-118 bus system was represented via a gaussian kernel density estimate in Fig. 3. Further details on the generation process, along with the figure of indexed Bus systems, could be found in work done by (Wang et al. 2020) and (Bi and Zhang 2014).
The dataset consisted of 100,000 instances of training examples from which 30% were used for validation for each model. As the validation dataset is considerably large, it will help in generalizing the model’s performance as it covers a wide range of samples. The test data, consisting of 10,000 instances, was kept separate from the training/validation data. Prior to modelling, the data was standardized by removing the mean and dividing by the standard deviation to ensure a normal distribution. This preprocessing step is beneficial to prevent unwanted influence of data scaling on model performance.
Evaluation metrics
To evaluate the performance of each trained model, two metrics were used. The first one is a standard metric \({F}_{1}\)-Score for multi-label classification while the second one is a custom metric called row accuracy (defined later). \({F}_{1}\)-Score is obtained by taking the geometric mean of two individual measures called precision and recall. The precision is determined by dividing the number of accurately anticipated positive instances by the total number of samples predicted as positive (by the model). In the case of a multi-label classification problem, the results from multiple labels must be averaged together to come up with a single number. For that purpose, the technique of micro-averaging is used. In this method, all the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) are added for each class so that their average can be taken. This metric could be obtained using Eq. (14)
where \(\mathcal{C}\) represents the total number of nodes, i.e., 19 and 180 for IEEE-14 and IEEE-118, respectively. The second measure of recall is determined by dividing the number of accurately anticipated positive instances by the total number of positive predictions that could have been made. A similar strategy of micro-averaging was used for this metric as well, and the resulting Eq. (15) is given below.
The range of both metrics is from 0 (indicating no relationship) to 1 (indicating a perfect score). These metrics could be summed into a single metric called \({F}_{1}\)-Score. Taking the geometric mean of Eqs. (15, 16) gave this metric—as given in Eq. (16)
The drawback of this metric is that it gives the same importance to both precision and recall and, in our case, provides limited insight into the performance of models. To deal with this issue, another metric that enforces stricter criteria of performance was used. For a single instance, a prediction by the model is only considered correct if all the locations are predicted correctly. This is usually referred to as exact match ratio or row accuracy. For the whole dataset, it could be calculated using Eq. (17).
where \(I\) is an indicator function. \(y\), \(\widehat{y}\) And \(n\) represent ground truth, predicted value, and the number of training examples, respectively. Its output also ranges from 0 (indicating worst performance) to 1 (indicating perfect score). In comparison to \({F}_{1}\)-score, which considers the partial correctness of each instance, \(RACC\) is much harsher, and the score is only counted if all the labels in a class are predicted correctly.
Experiments and results discussion
System configuration
The experiments were conducted on a machine with Tesla K80–16 GB GPU, 16 GB system memory, 750 GB disk space, and 2vCPU at 2.2 GHz. For model training, important libraries/dependencies, such as TensorFlow (version 1.15.2) and Keras (version 2.3.1) were used. In terms of important libraries/dependencies, for models training, TensorFlow (version 1.15.2) and Keras (version 2.3.1) were used. For data pre-processing, Pandas (version 1.1.3) and NumPy (version 1.18.1) were used. And for generating visualizations, matplotlib (version 3.2.1) was used. All models were trained using a batch size of 100, Adam optimizer, row accuracy as metrics and a number of epochs varying according to the specific test step as demonstrated in following subsections. Moreover, the ReduceLROnPlateau was used as callback to reduce the learning rate if the model did not improve. It enhances the learning curve and ensures faster convergence for the model.
Results discussion for IEEE-14 bus system
In terms of performance, the models were judged primarily based on RACC. Two factors were considered to evaluate system constraints: computational and system memory requirements, i.e., time taken for training and number of trainable parameters of each model, respectively. Attention-based bidirectional LSTM model had the highest value of RACC (94.2%) as compared to MLP (83.8%), CNN (92.9%), LSTM (93.7%), and Hybrid models (93.4%). Apart from MLP, in terms of RACC, almost all the other models gave a comparable performance. Figure 4 shows the learning curve of the best-performing attention-based and least-performing MLP models.
The high performance of the attention-based model came at a cost of higher memory and computation requirements. Its number of trainable parameters were \(6.2\times {10}^{5}\), which was almost 10 times more than the MLP. With the performance compromise in RACC of only 1.28%, CNN used nearly six times less trainable parameters. The second most expensive model in terms of system memory was the Hybrid model with \(4.0\times {10}^{5}\) trainable parameters.
The memory requirement of each model, along with its training mechanism, correlated directly with the computational time required to train the model. As expected, the attention-based model took the most time i.e., almost 27 min, about seven times more than the MLP model. CNN, on the other hand, took only about 5 min. The comparison between all the models in terms of RACC, time taken, and the number of trainable parameters is given in Fig. 5.
Figure 6 presents the receiver operating characteristic (ROC) curve which illustrates the performance of MLP, CNN, CNN-LSTM, and LSTM models. All four methods achieved an AUC of 1, indicating their excellent discriminatory power. However, the ROC curve for the MLP method shows a slight bend away from the top left corner compared to the other curves. This suggests that the MLP method has a slightly higher false positive rate relative to the other methods, while still maintaining a high true positive rate. On the other hand, the curves for CNN, CNN-LSTM, and LSTM are closer to the top left corner, indicating their strong performance in accurately distinguishing between positive and negative instances.
Similar remarks could be observed from precision-recall curve of the four models which is presented in Fig. 7. The MLP curve demonstrates a slight bend away from the top right corner, indicating a higher false positive rate compared to the other models. On the other hand, the CNN curve also shows a bend, but it is closer to the top right corner, suggesting a relatively lower false positive rate. The curves for CNN-LSTM and LSTM are both closer to the top right corner, indicating their strong ability to maintain high precision while minimizing false positives.
Figure 7 displays the Precision-Recall curve for MLP, CNN, CNN-LSTM, and LSTM models. MLP and CNN models exhibits a similar performance where their curves bend away from the top left corner indicating a higher false positive rate. Similar to ROC curve, CNN-LSTM and LSTM precision-recall curves remain closer to the top left corner, suggesting lower false positive rates. In terms of precision, all models achieve high precision with perfect or near-perfect average precision scores. Overall, the models demonstrate strong discriminatory ability, but MLP and CNN show a slightly higher false positive rate compared to CNN-LSTM and LSTM.
It could be concluded that CNN offered the best trade-off between computational/memory requirements and performance. Attentional-based models are only recommended for smaller bus systems, like IEEE-14, if systems constraints are not an issue and performance is of utmost priority.
Results discussion for IEEE-118 bus system
Similar criteria of judgment were used for the IEEE-118 bus system as well. LSTM model and CNN had highest value of RACC (81.8%) as compared to MLP (75.2%), Attention-based model (63.5%) and Hybrid models (79.7%). In contrast to the IEEE-14 Bus system, the performance of models showed higher deviation, and not all of them showed comparable performance.
Figure 8 plots the learning curve of the best-performing LSTM model and the least-performing attention-based model. Here it is important to note that the later model had the highest RACC, on the test dataset, in the IEEE-14 bus system's data. This indicates that attention-based models do not scale well to the higher dimensions of data.
Another difference, in terms of system memory requirements, was that the number of trainable parameters in the IEEE-118 bus system among different models was comparable. As shown in Fig. 9, the attention-based model had the highest number of trainable parameters, i.e., \(8.8\times {10}^{6}\) which were only about two times the number of parameters in CNN. This number for the rest of the models lies between the two models. The second most expensive model in terms of system memory was LSTM with a \(4.3\times {10}^{5}\) trainable parameters.
On the other hand, the computational time required to train each model showed a much higher deviation—as compared to the number of parameters. As expected, the attention-based model took the most time, i.e., almost 127 min, which was almost 12 times more than the MLP model. And as compared to the IEEE-14 bus system, this model took almost five times more time for training.
The higher computational and memory requirements do not translate well into higher performance for the IEEE-118 bus system. On the contrary, it showed the opposite trend. Both LSTM and CNN gave the same performance, but the former one took almost four times for training. Hence, CNN is the best model recommended for the IEEE-118 bus system. The comparison between all the models in terms of RACC, time taken, and the number of trainable parameters is given in Fig. 9.
Based on the ROC curve, presented in Fig. 10, the MLP and CNN models demonstrated exceptional discriminatory ability, achieving an AUC of 1 with high sensitivity and low false positive rates. The CNN-LSTM model also achieved an AUC of 1, showcasing accurate classification. The LSTM model exhibits a lower discriminative power with an AUC of 0.98. These results highlight the impressive performance of these models in distinguishing between positive and negative attack instances.
The exact models’ behaviour is exhibited in precision-recall curve, presented in Fig. 11, where MLP, CNN and CNN-LSTM models achieved perfect precision with average precision score (AP) of 1. This result indicates the models’ accurate positive predictions without any false positives, regardless of the threshold or cut-off used to classify instances. However, the LSTM model had an AP of 0.94, suggesting a few false positive predictions compared to the other models.
Given that MLP has a nearly excellent Precision-Recall and ROC curves but has low accuracy, a classification machine learning method can have excellent precision-recall and ROC curves while having low accuracy due to class imbalance and misclassification errors. The model performs well in identifying positive instances, as reflected in the precision-recall curve, and effectively discriminates between positive and negative classes, indicated by the ROC curve. However, the dominance of the majority class and misclassifications result in low overall accuracy.
In summary, the comparison between the ROC curve and Precision-Recall curve results indicates strong performance for all four classification models (MLP, CNN, CNN-LSTM, LSTM). The models demonstrate excellent discriminative ability, with minor variations in precision. CNN shows consistent performance over all the evaluation techniques, while CNN-LSTM and LSTM exhibit slightly lower precision and MLP exhibits low accuracy.
Resource-performance trade-off analysis
To quantify the trade-off between computational requirements and performance of different models for FDI attacks detection, a manual grid search was conducted. The number of layers and the number of neurons in each layer were varied. The above experimentations considered the number of layers and neurons to be 2 and 128, respectively. However, this analysis is required to choose the optimal configuration for any model based on the application requirement and resources availability. The number of layers was set to 1, 2, and 4, and the number of neurons in each layer was set to 64, 128, and 256. This yielded nine combinations for each model. Against each such combination, the test RACC, \({F}_{1}\)-Score and time taken for training have been provided in Table 2. Only for the IEEE-118 bus system's LSTM model, attention-based model, and hybrid model, the most complex configuration refers to the number of layers equal to 4 and the number of neurons in each layer equal to 256.
The RACC test for all architectures, consistently increases for the IEEE-14 power system as the number of layers and neurons per layer increases by a negligible amount or less than 8%. The MLP averages at 81.94% for all its IEEE-14 power systems. While the average values for the CNN, LSTM attention-based LSTM and hybrid CNN-LSTM architectures are at 92.05%, 93.71%, 84.52%, and 93.21% respectively. The LSTM architecture has the best average score of RACC with a margin up to 13%, but the attention-based LSTM has the highest induvial score of 94.53% with 2 layers and 256 neurons.
The RACC percentages for architecture for the IEEE-118 power system are averaged at 72.53%, 80.84%, 81.73%, 62.47% and 81.04% for MLP, CNN, LSTM attention-based LSTM and hybrid CNN-LSTM respectively. The RACC test values are overall lower than their IEEE-14 power system counterparts. The LSTM model provided the best results for the IEEE-118 power system as well with a margin up to 23.57%.
The time taken by the IEEE-14 and IEEE-118 power system for the MLP architecture is 4.18 and 12.80, CNN 5.05 and 18.56, LSTM 15.00 and 52.02, attention-based LSTM 32.92 and 118.73 and hybrid CNN-LSTM 15.65 and 60.57 respectively. This shows that the MLP is the least time-consuming architecture, however, it lacks in its accuracy when compared with the other models.
The attention-based LSTM and the hybrid CNN-LSTM required around twice as long as the other architectures in both IEEE-14 and IEEE 118 power systems. The time taken by an increase in the number of layers and number of neurons directly correlated to this increase of time needed for each model. For example, the LSTM architecture needed 6.09 min for one layer, 64 neuron instance, while 28.56 min was needed for a 4 layered 256 neuron instance.
The above analysis showed that the choice of models and their configuration was more important than the computational resources available. The better configuration consisted of a lower number of layers and a higher number of neurons in each layer. Finally, such analysis also provided insights into making a better choice for choosing the optimal configuration of the model depending on the application or the error tolerance requirement(s) and computational resources available
Scalability
The performance of deep learning models increased substantially with the increase in the \({l}_{2}\)-norm of the generated attacks. Higher value of \({l}_{2}\)-norm indicated that the generated FDI attacks tried to cause major disruption in the power bus system and thus are easily detectable as compared to the subtle attacks which were generated from the lower value of \({l}_{2}\)-norm. For this analysis, all four models, except the attention-based model, were considered because it does not provide proportional gain in performance for the computational cost required to train them—evident from the trade-off analysis in Sect. ‘‘Evaluation Metrics’’
For the IEEE-14 power system, Increasing the \({l}_{2}\)-norm from 1 to 5 gave the increase in RACC of 11.98%, 6.48%, 5.52%, and 6.24% for MLP, CNN, LSTM, and CNN-LSTM, respectively. The effect of \({l}_{2}\)-norm on the performance of all the mentioned has been given in Fig. 12. The performance of the models was directly proportional to the severity of the attack, i.e., \({l}_{2}\)-norm.
For the IEEE-118 power system, Increasing the \({l}_{2}\)-norm from 1 to 5 gives the increase in RACC of 18.09%, 15.35%, 15.49%, and 16.31% for MLP, CNN, LSTM, and CNN-LSTM respectively. This change is significantly higher for the IEEE-118 power system as compared to the IEEE-14 power system. For largest \({l}_{2}\)-norm, the highest RACC achieved by CNN was 97.36% which was only 2.26% lower than the highest accuracy achieved in the IEEE-14 power system. the overall trend remained the same as the IEEE-14 power system, proportional relationship between RACC and \({l}_{2}\)-norm. The effect of \({l}_{2}\)-norm on the performance of all the mentioned has been given in Fig. 13.
Discussion
Overall, despite the high performance of LSTM and attention-based LSTM models, they come at the cost of increased training time and computational resources. On the other hand, simpler models like MLP are not suitable due to their low accuracy results. CNN achieved moderately lower accuracy but did so in one-fourth of the time taken by the attention-based LSTM model. Moreover, the ROC and Precision-Recall Curves demonstrated the balanced performance of CNN model in classifying the location of the attack duo to its high discrimination power which is due the perfect AUC scores. However, MLP archives similar AUC scores, but it failed to attain high accuracy. This comparison provides decision-makers with a clear understanding of the trade-offs, enabling them to determine which model is most suitable for their specific application.
Despite the reliable performance of deep learning models, they often exhibit low interpretability. It can be challenging to determine which features or factors significantly influence the model’s results. However, visualization techniques, such as feature maps, can be employed to gain insights into how different features impact the model’s decisions.
Conclusion
Lately, extensive work has been conducted on detecting FDIA using deep learning. Hence, this research explores the recent development in this field, and performs an extensive comparison on the most common high performing models in detecting FDIA at the granularity of each node in the IEEE-14 and IEEE-118. The models, CNN, LSTM, CNN-LSTM and attention-based model were analysed and compared against MLP as baseline model. Various evaluation criteria were used to ensure that the model's performance in all aspects was measured fairly and to verify its robustness, such as RACC test, F1 score, computational time, memory space required, PR plot, and ROC plot.
The attention-based bidirectional model performs well for the small, IEEE-14, bus system while LSTM/CNN works best for the larger, IEEE-118, bus system according to our results. CNN on the other hand performs well for both bus systems and takes less training time. For Intra-algorithms comparison, the configuration of each model was varied for 9 combinations of number of layers and neurons in each layer to select an optimal configuration depending on the application or the error tolerance requirement(s) and computational resources available. The results suggest that the best arrangement had fewer layers and a greater number of neurons in each layer. In case of the larger IEEE-118 bus system and for systems with resource constraints, it is recommended to use a single layer CNN model with 64 neurons. The later configuration is more suitable for deployment on systems with higher computational and memory capacity. Finally, the effect of \({l}_{2}\)-norm, which indicated the severity of the FDI attack, was quantified for five different variants of the dataset. The highest jump in performance was seen for MLP model. The RACC for both the IEEE-14 and IEEE-118 bus systems showed a significant increase.
In future work, complex models can be explored such as reinforcement learning and temporal modelling. Moreover, given the criticality and sensitivity of this application, real-time test cases and data should be considered in the benchmarking.
Data availability
No datasets were generated or analysed during the current study.
References
Bebis G, Georgiopoulos M (1994) Feed-forward neural networks. IEEE Potentials 13(4):27–31. https://doi.org/10.1109/45.329294
Bi S, Zhang YJ (2014) Using covert topological information for defense against malicious attacks on DC state estimation. IEEE J Sel Areas Commun 32(7):1471–1485. https://doi.org/10.1109/JSAC.2014.2332051
Bitirgen K, Filik ÜB (2023) A hybrid deep learning model for discrimination of physical disturbance and cyber-attack detection in smart grid. Int J Crit Infrastruct Prot 40:100582. https://doi.org/10.1016/J.IJCIP.2022.100582
Brown MA, Zhou S (2019) Smart-grid policies in advances in energy systems. John Wiley & Sons, Ltd, Hoboken
Chauhan NK, Singh K (2018) A review on conventional machine learning vs deep learning. 2018 Int Conf Comput Power and Commun Technol (GUCON). https://doi.org/10.1109/GUCON.2018.8675097
Chopade P, Bikdash M (2016) New centrality measures for assessing smart grid vulnerabilities and predicting brownouts and blackouts. Int J Crit Infrastruct Prot 12:29–45. https://doi.org/10.1016/J.IJCIP.2015.12.001
Cui L, Qu Y, Gao L, Xie G, Yu S (2020) Detecting false data attacks using machine learning techniques in smart grid: a survey. J Netw Comput Appl 170:102808. https://doi.org/10.1016/j.jnca.2020.102808
Ding D, Han Q-L, Xiang Y, Ge X, Zhang X-M (2018) A survey on security control and attack detection for industrial cyber-physical systems. Neurocomputing 275:1674–1683. https://doi.org/10.1016/j.neucom.2017.10.009
Esmalifalak M, Liu L, Nguyen N, Zheng R, Han Z (2017) Detecting stealthy false data injection using machine learning in smart grid. IEEE Syst J 11(3):1644–1652. https://doi.org/10.1109/JSYST.2014.2341597
Feng H, Han Y, Li K, Si F, Zhao Q (2024) Locational detection of the false data injection attacks via semi-supervised multi-label adversarial network. Int J Electr Power Energy Syst 155:109682. https://doi.org/10.1016/j.ijepes.2023.109682
GitHub—wsyCUHK/WSYCUHK_FDIA: Locational detection of false data injection attack in smart grid: a multi-label classification approach. Accessed. 30 Jan. 2023. https://github.com/wsyCUHK/WSYCUHK_FDIA
He Y, Mendis GJ, Wei J (2017) Real-time detection of false data injection attacks in smart grid: a deep learning-based intelligent mechanism. IEEE Trans Smart Grid 8(5):2505–2516. https://doi.org/10.1109/TSG.2017.2703842
Ibraheem R, Eddin ME, Massaoudi M, Abu-Rub H (2024) Enhancing locational FDIA detection in smart grids a hyperparameter optimization analysis in 2024. 4th Int Conf Smart Grid Renew Energy (SGRE). https://doi.org/10.1109/SGRE59715.2024.10428762
Kallitsis MG, Bhattacharya S, Stoev S, Michailidis G (2016) Adaptive statistical detection of false data injection attacks in smart grids in. IEEE Global Conf Signal Inform Process (GlobalSIP) 2016:826–830. https://doi.org/10.1109/GlobalSIP.2016.7905958
Kundu A, Sahu A, Serpedin E, Davis K (2020) A3D: Attention-based auto-encoder anomaly detector for false data injection attacks. Electric Power Syst Res 189:106795. https://doi.org/10.1016/j.epsr.2020.106795
Kurt MN, Yılmaz Y, Wang X (2019) Real-time detection of hybrid and stealthy cyber-attacks in smart grid. IEEE Trans Inf Forensics Secur 14(2):498–513. https://doi.org/10.1109/TIFS.2018.2854745
Li Y, Wei X, Li Y, Dong Z, Shahidehpour M (2022) Detection of false data injection attacks in smart grid: a secure federated deep learning approach. IEEE Trans Smart Grid 13(6):4862–4872. https://doi.org/10.1109/TSG.2022.3204796
Lukicheva I, Pozo D, Kulikov A (2018) cyberattack detection in intelligent grids using non-linear filtering in 2018. IEEE PES Innov Smart Grid Technol Conf Eur (ISGT-Europe). https://doi.org/10.1109/ISGTEurope.2018.8571457
Majidi SH, Hadayeghparast S, Karimipour H (2022) FDI attack detection using extra trees algorithm and deep learning algorithm-autoencoder in smart grid. Int J Crit Infrastruct Prot 37:100508. https://doi.org/10.1016/J.IJCIP.2022.100508
Mukherjee D (2023) Detection of data-driven blind cyber-attacks on smart grid: a deep learning approach. Sustain Cities Soc 92:104475. https://doi.org/10.1016/j.scs.2023.104475
Mukherjee D, Chakraborty S, Abdelaziz AY, El-Shahat A (2022) Deep learning-based identification of false data injection attacks on modern smart grids. Energy Rep 8:919–930. https://doi.org/10.1016/J.EGYR.2022.10.270
Mukherjee D, Chakraborty S, Ghosh S (2022) Deep learning-based multilabel classification for locational detection of false data injection attack in smart grids. Electr Eng 104(1):259–282. https://doi.org/10.1007/s00202-021-01278-6
Musleh AS, Chen G, Dong ZY (2020) A survey on the detection algorithms for false data injection attacks in smart grids. IEEE Trans Smart Grid 11(3):2218–2234. https://doi.org/10.1109/TSG.2019.2949998
Niu X, Li J, Sun J, Tomsovic K (2016) Dynamic detection of false data injection attack in smart grid using deep learning. IEEE Power Energy Soc Instit Electr Electron Eng 1:1–6
Niu X, Li J, Sun J, Tomsovic K (2019a) “Dynamic detection of false data injection attack in smart grid using deep learning”, in. IEEE Power Energy Soc Innov Smart Grid Technol Conf (ISGT) 2019:1–6. https://doi.org/10.1109/ISGT.2019.8791598
Shi W, Wang Y, Jin Q, Ma J (2018) PDL: an efficient prediction-based false data injection attack detection and location in smart grid. Ann Computer Softw Appl (COMPSAC). https://doi.org/10.1109/COMPSAC.2018.10317
Teixeira A, Shames I, Sandberg H, Johansson KH (2015) A secure control framework for resource-limited adversaries. Automatica 51:135–148. https://doi.org/10.1016/j.automatica.2014.10.067
Wang S, Bi S, Zhang Y-JA (2020) Locational detection of the false data injection attack in a smart grid: a multilabel classification approach. IEEE Internet Things J 7(9):8218–8227. https://doi.org/10.1109/JIOT.2020.2983911
Xu B, Guo F, Wen C, Deng R, Zhang W-A (2014) Detecting false data injection attacks in smart grids with modeling errors a deep transfer learning based approach. arXiv preprint. https://doi.org/10.48550/ARXIV.2104.06307
Acknowledgements
The authors thank OpenUAE for OpenUAE Research and Development Group for their support.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
QN and MA conceptualized the study and acquired resources and funds. MAA conducted the main experiments and evaluation tests and wrote the original draft. TI and RB edited and reviewed the manuscript. BA and YB also worked on conducting experiments. OA validated the results and reviewed the manuscript along with QN and MA. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethical approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Nasir, Q., Abu Talib, M., Arshad, M.A. et al. Comparison of deep learning algorithms for site detection of false data injection attacks in smart grids. Energy Inform 7, 71 (2024). https://doi.org/10.1186/s42162-024-00381-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s42162-024-00381-9