 Research
 Open access
 Published:
Comparison of deep learning algorithms for site detection of false data injection attacks in smart grids
Energy Informatics volume 7, Article number: 71 (2024)
Abstract
False Data Injection Attacks (FDIA) pose a significant threat to the stability of smart grids. Traditional Bad Data Detection (BDD) algorithms, deployed to remove lowquality data, can easily be bypassed by these attacks which require minimal knowledge about the parameters of the power bus systems. This makes it essential to develop defence approaches that are generic and scalable to all types of power systems. Deep learning algorithms provide stateoftheart detection for FDIA while requiring no knowledge about system parameters. However, there are very few works in the literature that evaluate these models for FDIA detection at the level of an individual node in the power system. In this paper, we compare several recent deep learningbased model that proven their high performance and accuracy in detecting the exact location of the attack node, which are convolutional neural networks (CNN), Long ShortTerm Memory (LSTM), attentionbased bidirectional LSTM, and hybrid models. We, then, compare their performance with baseline multilayer perceptron (MLP)., All the models are evaluated on IEEE14 and IEEE118 bus systems in terms of row accuracy (RACC), computational time, and memory space required for training the deep learning model. Each model was further investigated through a manual grid search to determine the optimal architecture of the deep learning model, including the number of layers and neurons in each layer. Based on the results, CNN model exhibited consistently high performance in very short training time. LSTM achieved the second highest accuracy; however, it had required an averagely higher training time. The attentionbased LSTM model achieved a high accuracy of 94.53 during hyperparameter tuning, while the CNN model achieved a moderately lower accuracy with only onefourth of the training time. Finally, the performance of each model was quantified on different variants of the dataset—which varied in their \({\text{l}}_{2}\)norm. Based on the results, LSTM, CNN obtained the highest accuracy followed by CNNLSTM and lastly MLP.
Introduction
The development of smart grids has enabled safe and efficient power transfer from generators to consumers. By integrating information and communication technologies (ICTs) such as sensors and Internet of Things (IoT) devices, twoway communication between the grid and consumers is facilitated, making consumers active participants in the power consumption process. This enhanced control empowers consumers, while the electric grid gains increased flexibility to adapt to power requirements through a constant data stream. With the transition to renewable energy, smart grids have become crucial due to their ability to reduce the risk of power outages, given the intermittent nature of renewable energy sources.
Various incentive and pricingbased mechanisms exist to balance demand with the available power supply. Smart grids allow for more granular and realtime control over pricing in an open power market, allowing for effective management of peak power demand and reducing pricing volatility over time. In short, smart grids equip cities to utilize energy sources with a high proportion of green energy while effectively managing consumer demand in realtime.
Moreover, smart grids exhibit superior reliability and resilient to blackouts compared to conventional grids. In the latter case, when the transmission line, the load is shifted to other lines, potentially causing a cascading effect of transmission line failures, and ultimately crippling the entire grid (Chopade and Bikdash Mar. 2016). In contrast, smart grids can detect and isolate faults, preventing them from affecting the entire power grid. The isolated sections of the grid could then be diagnosed easily because of the twoway communication enabled by the ICT infrastructure. Additionally, the distributed nature of smart grids allows customers to become electricity producer via renewable energy sources via netmetering (Brown and Zhou 2019). In such cases, customers are referred to as “prosumers” rather than consumers within the smart grid framework. Figure 1 illustrates the differences between conventional and smart grids.
Due to the benefits mentioned above, substantial investments have been made in the development of smart grids systems. However, this technology’s development is still in its early stages, especially when it comes to security, making it vulnerable to various cyberattacks. One of the most critical and harmful cyberattacks is FDIA. To counteract these attacks, several methods have been deployed, one of which being BDD.
BDD uses state estimation (SE), which removes duplicate data and incorrect measurement, and state variables (SVs), such as magnitude and phase voltage, to detect anomalies at the distribution end (e.g., electricity theft).
SE is performed at each meter/node to obtain the nonnoise part of measurements, i.e., SVs. These SVs are transmitted for appropriate action at the control room, such as adjusting the voltage/angle for the general area. Mathematically, for ndimensional measurement \(\text{z}={\left({\text{z}}_{1},{\text{z}}_{2},\dots ,{\text{z}}_{\text{n}}\right)}^{\text{T}}\) and the mdimensional system state \(x={\left({x}_{1},{x}_{2},\dots ,{x}_{m}\right)}^{T}\), the relationship is given in Eq. (1).
where \(e\) represents the noise and \(H\) is a matrix (for DC state estimation) of partial derivatives unique to each power system. The framework is adapted from Wang et al. (2020). A straightforward approach to detecting anomalous values via state estimation is to compute the residual (r) by the \({l}_{2}\)norm between \(z\) and \(Hx\) and compare it against a predefined threshold (\(t\)) as shown in Eq. (2).
However, BDD is vulnerable to manipulation, exposing the smart grids to cyberattacks. The purpose of FDIA is to manipulate the state estimation and make it acceptable as per the criteria of BDD. Mathematically, the modified attack vector can be expressed by (3) with \(c\) being a nonzero vector of normal states.
However, the new state estimation can be described as in (4) by changing the value of z, whereas the attack vector\(a=Hc\):
BDD could detect a random/typical change in the value of z. To detect FDIA, it is required to find an attack vector that is equal to \(Hc\). This works by ensuring that the \({l}_{2}\)norm of the residual, as given in Eq. (2), remains the same. Mathematically, it is described in Eq. (6) by substituting (3) and (4).
If \(a\) is generated by fulfilling Eq. (5), a new comprised state estimation \(\widehat{x}\) is considered correct as per BDD and hence, the attack took place and went unnoticed.
With the advancement of AI and deep learning techniques, datadriven algorithms for FDIA have become very popular recently. Many research studies have been published focusing on deep learning techniques, particularly due to the availability of high computing power. Based on the literature review, which will be shown in Sect. "Literature Review", several models have recorded high accuracy and demonstrated good performance. However, some of these models require significant resources. Therefore, decisionmakers need a comprehensive comparison of the commonly highperforming models to understand their generalization, robustness, and computational requirements. Therefore, this work contributes to the development of secure smart grids which are resilient to FDIA in the following ways:

Identify robust deep learning models and their architectures to detect the attack at every node/site in the smart grid (using row accuracy).

Quantify the tradeoff between computational requirements and performance of different deep learning algorithms for FDIA detection—at the granularity of each node in the smart grid.

Benchmark the performance of deep learning algorithms under the different scenarios of severity of attacks (indicated via \({l}_{2}\)norm).
The paper is organised as follows. Sect. "Literature Review" briefly describes the previous works in the field of FDIA detection. The background and methodology are discussed in Sect. ‘‘Background And Methodology’’. Sect. "Experiments and Results Discussion" explains the results obtained. Finally, the paper is concluded in Sect. "Conclusion".
Literature review
Various research work have taken different approaches to classify the types of attacks in smart grids. For instance, (Musleh et al. 2020) accounted for the attack delivery method to come up with the classification, i.e., Cyberbased, networkbased, communicationbased, and physicalbased attacks. On the other hand, (Cui et al. 2020) took a more holistic approach to account for all types of false data attacks, including FDIA, such as replay attacks (Ding et al. 2018) and zero dynamic attacks (Teixeira et al. 2015). From the perspective of algorithm type, being used for FDIA detection, a spectrum of techniques/approaches exist in the literature. Majorly, they could be divided into two categories i.e.: modelbased and datadriven detection algorithms (Musleh et al. 2020). A comparison between the two is given in Appendix, Table 3.
Modelbased algorithms for FDIA detection
These models do not require prior training but only rely on the processed data and system configuration/parameters (Musleh et al. 2020). These could be further divided based on the factor if the changing nature of system configuration is accounted for in the modeling, i.e., Quasistatic and Dynamic models. The difference between the two is that in dynamic models, the system’s state is dependent not only on the present measurements/data but also on the previous states. Mathematically, described in Eq. (6).
where \(h\) is the nonlinear power flow equation, \(f\left(.\right)\) relates the value of state variable \(x\), from the previous time (\(t1)\) to the current time \(t\). \(e\) and \(v\) are the error terms. Vector Autoregression (VAR) (Shi et al. 2018), and Kalman filters (KF) (Kurt et al. 2019) are two such examples. For Quasistatic models, some examples include using Media Filtering (MF) (Lukicheva et al. 2018) and Kriging Estimator (KE) (Kallitsis et al. 2016). This work is more concerned with the stateoftheart deep learning approaches for FDIA detection, and those approaches fall majorly in the category of datadriven algorithms. Section ‘‘DataDriven Algorithms for FDIA Detection’’ describes those algorithms in detail.
Datadriven algorithms for FDIA detection
Compared to the last approach, this category of algorithms is not dependent on the manual setting of model parameters. Instead, the algorithms learn the best suitable parameters from the dataset—specific to that problem. All machine learning, specifically deep learning algorithms, fall into this category. This dependency on historical data leads to higher memory and computational cost. However, the advantage of these models is that they are more scalable and give a stateoftheart performance, which is essential in operationally critical applications such as smart grids. In the current decade, deep learning algorithms have surpassed most traditional algorithms' performance (Chauhan and Singh 2018). That's why while exploring the literature, we focus mainly on deep learning approaches. An overview of different methods proposed in the literature is given in Table 1.
In ref (Niu et al. 2019a) proposed a deep learningbased framework to detect FDI attacks using a hybrid RNNCNN model. It used the dataset generated from the IEEE39 power system and fed the timeseries data meter measurements and network traffic features. The granularity of predictions was the entire power system, i.e., if the whole system has been compromised or not. The detection accuracy was reported at 90% for the most aggressive attack. However, the accuracy decreases significantly when the aggressiveness, measured with respect to the number of compromised buses, decreases. It achieved a detection accuracy of 90%.
In ref (Wang et al. 2020) proposed a deep learningbased locational detection (DLLD) network to detect FDI attacks for IEEE14 and IEEE118 power systems. The dataset was generated by (Bi and Zhang 2014). It utilized the CNN model and documented results varying the number of layers used in the network. In contrast to Niu et al. 2019a, the granularity of predictions was set to each meter, i.e., it could distinguish the exact location of node(s) being compromised. In terms of row accuracy (RACC), defined in the section on the evaluation metrics, DLLD gave 97% and 93% for IEEE14 and IEEE118 power systems, respectively (Mukherjee et al. 2022). Expanded the similar work, for the purpose of locational detection, for the IEEE118 bus system. They experimented with the CNN model and hybrid models i.e., CNNLSTM, CNNGRU. The reported RACC was 93% for IEEE118 bus system, which was the same as reported by (Wang et al. 2020). The development of hybrid models, for the purpose of FDIA detection, was also a contribution from this work but the reported RACC value for these models was only 76%, which is very less as compared to the performance of, already established, CNN. Similarly, this work also documented results varying the number of layers used in the network but not in terms number of neurons in each layer.
In ref (Esmalifalak et al. 2017) used an SVM model (which falls in the category of the datadriven algorithm) and an anomaly detection algorithm (which is a modelbased algorithm) to detect the stealth FDI attacks in the IEEE118 bus system. The distributed SVM was based on the multiplier’s alternating direction method. For lower computational complexity, the anomaly detection algorithm used the deviation in measurement to detect the attacks instead of learning from the historical data. The granularity of prediction was set to the entire power system. The detection accuracies of 82% and 78% were reported using SVM and anomaly detection algorithms, respectively.
Recent work in the literature has expanded this application into various domains of deep learning, such as unsupervised learning and transfer learning. An unsupervised learning technique called Attentionbased autoencoder anomaly detector (A3D) was proposed by (Kundu et al. 2020). It used the realworld power flow data from the Texas grid, and patterns of its temporal evolution were simulated on the IEEE14 bus system. The granularity of predictions was set to each individual meter. The \({F}_{1}\)—score of 94% was reported. Being an unsupervised learning technique, this approach needed neither the variety of data nor its labels from the training data. However, this work does not expand on larger bus systems and the review of literature also suggests that the development of unsupervised techniques for FDIA detection is still in the initial stages of research (Xu et al. 2104) utilized transfer learning to consider the dynamic nature of the realworld transmission line parameters. The simulated data on the IEEE14 and IEEE118 bus system was considered as the source domain and the power system, with realworld variation in transmission line parameters, was considered as the target domain. Prediction granularity was set to each meter—like (Wang et al. 2020; Mukherjee et al. 2022; Kundu et al. 2020). A deep neural network (DNN) used both the simulated and realworld data for pretraining, while only the latter one was used for finetuning. It gave the highest detection accuracy of 99.99% but other metrics such as \({F}_{1}\)score or RACC were not reported.
The use of a convolutional Neural Network (CNN) and a Long ShortTerm Memory (LSTM) network were proposed by (Niu et al. 2019b) and (Mukherjee 2023) which were tested on the IEEE39 bus system. They used a combined attack detection mechanism using a static detector, being a state estimator, and a deep learningbased scheme. The proposed scheme had a detection accuracy above 90% for high \(\frac{k}{n}\), with k being the number of injected measurements and n is the total number of measurements.
In ref (He et al. Sep. 2017) proposed a Conditional Deep Belief Network (CDBN) (Feng et al. 2024) which can detect a type of FDIA that can bypass the SVE mechanism, from realtime measurements. To evaluate the performance of the proposed scheme, simulations were conducted. As a result, in comparison to models such as ANNbased and SVMbased, the scheme is more resilient to different numbers of the attacked measurements, the different detection thresholds of SVE, and some levels of environment noise levels.
In ref (Mukherjee et al. Nov. 2022) proposed a realtime FDIA identification system valid on IEEE 14bus system. This method utilizes the error covariance matrix. The architecture uses a nonlinear LSTM structure which is comprised of 8 hidden layers, two of which are the input and output layers. The results of this paper showed an accuracy of 95% on detecting the presence of the attack.
In ref (Bitirgen and Filik Mar. 2023) used CNNLSTM with particle swarm optimization on a phasor measurement unit dataset. This dataset was obtained from an opensource simulated power system. This system proved to be the most accurate with 98.94% when compared to the other metrics.
An extremely randomized trees algorithm was proposed by (Majidi et al. Jul. 2022). Five hidden layers were used with varying neurons depending on the bus used. The dataset was simulated on the IEEE14, IEEE30, IEEE57, and the IEEE118 power systems. Due to the scarcity of real data, datasets were generated randomly within a 30% range. A stacked autoencoder was used along with additional trees classifier as this decreases the computational complexity. The results showed 98%, 94%, 99%, 99% for power systems IEEE14, IEEE30, IEEE57, and the IEEE118 respectively.
In ref (Li et al. Nov. 2022) proposed a deep learning by combining Paillier cryptosystem with federated learning. The power systems used to test this architecture are IEEE14 and IEEE118. All nodes were able to jointly train a detection model while maintaining the privacy of all the local training data thanks to the usage of a Transformerbased model. It was concluded that the Federated learning proved to be more secure as well as more accurate than the CNN and LSTM algorithms. The proposed method showed a precision of 99% for the IEEE14 power system while the CNN and LSTM were 98% and 0.99% respectively. This method also showed a precision of 82% for the IEEE118, while a 60% and 61% for CNN and LSTM, respectively.
Based on the reviewed studies, CNNs, LSTMs, CNNLSTM hybrids and attentionbased model have proven high performance in FDIA across various power systems. Hence, we will extensively explore them in this study and compare them to MLP model as a baseline to examine their effectivity through a fair comparison.
Background and methodology
Datadriven algorithms for FDIA detection
FDI detection problem was framed as a multilabel classification problem—inspired by (Wang et al. 2020). The power system data was taken at the granularity of each node, and the training data had the ground truth and status of each node, i.e., if it was compromised or not, onetoone mapped. The output format was a vector of binary values equal to the length of the number of measurements in the indicated bus system.
The task of the deep learning model was to learn the relationship between these two values using irregularity in the value of the directly mapped node and the connection of that node to the nearest ones. This framework is dependent only on the measurement readings data without the need for any previous state or system parameters. As explained in the next section, this framework was tested for several deep learning models. For any model, the presence of attack could be calculated using Eq. (7)
where \({O}_{i}\) indicate the output of the model at the \({i}^{th}\) node. In our case, the threshold value (\(t\)) of 0.5 was used.
Deep learning models used
The theoretical explanation of selected deep learning models is provided here.
Multilayer perceptron (MLP)
MLP is a feedforward artificial neural network (Bebis and Georgiopoulos 1994) that consists of multiple layers of neurons. It consists of input, output, and single/multiple hidden layer(s). In MLP each neuron is connected to all the neurons in the following layer. These models can classify nonlinear data through their structure of layers and the nonlinear activations after every neuron. Our models consist of two dense layers with 180 neurons and include the nonlinear activations; ReLU and sigmoid.)
Convolutional neural network (CNN)
In contrast to MLP, CNN connects to only a few spatially adjacent neurons. During forward pass, a matrix of shared weights, known as kernels/filters, slides over the input data (or over the preceding layer), and during the backward pass, these kernels are adjusted. As compared to the traditional approaches, the kernels learned are automated instead of being handcrafted. The CNN model architecture can regularize the neural network and prevent it from overfitting and has efficient memory utilization as the weights/kernels used are shared by multiple neurons. Architecture summary of the model used is shown in Appendix, Table 4. It comprises 1dimension convolution and ReLU activation layers. Dropout layers were included as well to reduce the overfitting.
Long shortterm memory (LSTM)
In contrast, LSTM belongs to the category of RNN (He et al. 2017). These networks are specifically designed to deal with sequential data such as text and timeseries data. Their advantages include superior learning and less vanishing gradient issue. In contrast to RNN, in LSTM, three gates control the flow of information into and out of the main cell. This gives the ability to the cell to remember information over an arbitrary length of time without running into the vanishing gradient problem. These gates are termed as forget gate, input gate, and output gate. An illustration of a single LSTM cell is given in Fig. 2.
Forget gate deletes/reduces the irrelevant information from previous time steps. It takes as input the previous cell state (\({h}_{t1}\)) and input (\({x}_{t}\)). Its output ranges from 0 to 1. The amount of information taken from \({C}_{t1}\) is proportional to this value. Mathematically, this gate could be represented using Eq. (8)
where \({W}_{f}\) and \({b}_{f}\) represent the weights and biases of the layer comprising of forget gate. \(\sigma \) means sigmoid activation. Secondly, the LSTM model could save the new relevant information in the cell. There are two components required to make it work, i.e., a similar sigmoid layer, as used in the forget gate (which suggests the values that will get updated) and a \(tanh\) layer (to get new potential values, that could be saved in cell state). Mathematically, these two could be represented by Eqs. (9) and (10).
where \({CC}_{t}\) represents the list of potential values that could be added to the cell state. Now, we have the components ready and using Eq. (11), a new state \({C}_{t}\) could be calculated.
Finally, to calculate the output of the LSTM cell at the current timestep, as given in Eq. (8), the cell state, \({o}_{t}\) (after applying \(\text{tanh})\) is multiplied with the output of the sigmoid gate, i.e., Eq. (12).
where \({h}_{t}\) represents the final output from the cell. Finally, to prevent overfitting, two dropout layers were incorporated in the model design. The architecture summary of the model used is shown in Appendix, Table 5.
Attentionbased Bidirectional LSTM
In our case, the attentionbased models are an extension of the traditional LSTM model (RNN in general). By enabling the network to learn where to pay attention to the input to get the target values/sequences, it overcomes the constraint of fixedlength representation in the traditional LSTM design. In other words, it focuses on the steps where the relevant information is concentrated (Bebis and Georgiopoulos 1994). This results in significant computational requirements that could be very useful for smaller bus systems such as the IEEE14 bus system. In the model design, dropout layers were included to reduce overfitting. The final architecture summary of the model is presented in Appendix, Table 6.
Hybrid CNNLSTM Model
This architecture choice is inspired by Mukherjee et al. (2022). First, a few convolutional layers extract features which are then fed to a single dense layer. Then this dense layer is followed by the LSTM layer(s). Finally, the output from LSTM layers is flattened and passed to a dense layer with sigmoid activation. The activation function is \(relu\) and \(tanh\) for convolutional and LSTM layers respectively. The exact configuration for both has been previously described. The model also includes dropout layers to mitigate overfitting as shown in the architecture summary of the model presented in Appendix, Table 7.
Based on the description of each model, the models are unique and each one is characterized by specific strengths. CNNs can focus on spatial data and are suitable for locational FDIA detection, with high overfitting mitigation. LSTMs focus on patterns and sequences in power systems, while attentionbased LSTMs enhance this capability by paying attention to details. The Hybrid LSTMCNN leverages the strengths of both models, covering broader data patterns. Finally, MLP serves as the baseline nonlinear model, providing a useful reference for this study.
Dataset
This work builds on the dataset generated by (Wang et al. 2020) using IEEE14 and IEEE118 bus power systems, and it is publicly available (GitHub  wsyCUHK, WSYCUHK_FDIA2023). They simulated the data using MATPOWER, where the topologies of the two systems, and the generated dataset mimics the realword scenarios with realistic load scenarios and several noise levels as will be detailed in this section. The IEEE14 bus system consists of 20 transmission lines and 15 buses, with 19 measurements consists of 11 flow measurements and 9 injected ones, represented by 19 independent features in the dataset. Similarly, IEEE118 bus system has 180 transmission lines and 118 buses. It includes 110 flow measurements and 70 injected values, totalling 180 features in the IEEE118 dataset.
These nodes are indexed based on the network topology, i.e., closers nodes are indexed together and vice versa. This is because the readings at any node are affected by the neighbouring nodes. This relationship is captured well by the CNN—as explained later in the results section.
Five different variants of the dataset, for each power bus system, were generated, which varied in terms of the \({l}_{2}\)norm of the injection data. Its value ranged from 1 to 5. Lower value of \({l}_{2}\)norm indicates that the attack is more subtle, i.e., the attacked nodes are only minorly compromised; thus, this makes it difficult to detect these attacks. On the other hand, a higher \({l}_{2}\)norm suggests that the attacker is more desperate and is detected (relatively) easy. This dimension of data could tell us how well the solutions developed for FDIA detection will hold against the attackers in different practical scenarios. In all the experiments, \({l}_{2}\)norm has been set to 1. The univariate distribution of the number of compromised nodes for an IEEE118 bus system was represented via a gaussian kernel density estimate in Fig. 3. Further details on the generation process, along with the figure of indexed Bus systems, could be found in work done by (Wang et al. 2020) and (Bi and Zhang 2014).
The dataset consisted of 100,000 instances of training examples from which 30% were used for validation for each model. As the validation dataset is considerably large, it will help in generalizing the model’s performance as it covers a wide range of samples. The test data, consisting of 10,000 instances, was kept separate from the training/validation data. Prior to modelling, the data was standardized by removing the mean and dividing by the standard deviation to ensure a normal distribution. This preprocessing step is beneficial to prevent unwanted influence of data scaling on model performance.
Evaluation metrics
To evaluate the performance of each trained model, two metrics were used. The first one is a standard metric \({F}_{1}\)Score for multilabel classification while the second one is a custom metric called row accuracy (defined later). \({F}_{1}\)Score is obtained by taking the geometric mean of two individual measures called precision and recall. The precision is determined by dividing the number of accurately anticipated positive instances by the total number of samples predicted as positive (by the model). In the case of a multilabel classification problem, the results from multiple labels must be averaged together to come up with a single number. For that purpose, the technique of microaveraging is used. In this method, all the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) are added for each class so that their average can be taken. This metric could be obtained using Eq. (14)
where \(\mathcal{C}\) represents the total number of nodes, i.e., 19 and 180 for IEEE14 and IEEE118, respectively. The second measure of recall is determined by dividing the number of accurately anticipated positive instances by the total number of positive predictions that could have been made. A similar strategy of microaveraging was used for this metric as well, and the resulting Eq. (15) is given below.
The range of both metrics is from 0 (indicating no relationship) to 1 (indicating a perfect score). These metrics could be summed into a single metric called \({F}_{1}\)Score. Taking the geometric mean of Eqs. (15, 16) gave this metric—as given in Eq. (16)
The drawback of this metric is that it gives the same importance to both precision and recall and, in our case, provides limited insight into the performance of models. To deal with this issue, another metric that enforces stricter criteria of performance was used. For a single instance, a prediction by the model is only considered correct if all the locations are predicted correctly. This is usually referred to as exact match ratio or row accuracy. For the whole dataset, it could be calculated using Eq. (17).
where \(I\) is an indicator function. \(y\), \(\widehat{y}\) And \(n\) represent ground truth, predicted value, and the number of training examples, respectively. Its output also ranges from 0 (indicating worst performance) to 1 (indicating perfect score). In comparison to \({F}_{1}\)score, which considers the partial correctness of each instance, \(RACC\) is much harsher, and the score is only counted if all the labels in a class are predicted correctly.
Experiments and results discussion
System configuration
The experiments were conducted on a machine with Tesla K80–16 GB GPU, 16 GB system memory, 750 GB disk space, and 2vCPU at 2.2 GHz. For model training, important libraries/dependencies, such as TensorFlow (version 1.15.2) and Keras (version 2.3.1) were used. In terms of important libraries/dependencies, for models training, TensorFlow (version 1.15.2) and Keras (version 2.3.1) were used. For data preprocessing, Pandas (version 1.1.3) and NumPy (version 1.18.1) were used. And for generating visualizations, matplotlib (version 3.2.1) was used. All models were trained using a batch size of 100, Adam optimizer, row accuracy as metrics and a number of epochs varying according to the specific test step as demonstrated in following subsections. Moreover, the ReduceLROnPlateau was used as callback to reduce the learning rate if the model did not improve. It enhances the learning curve and ensures faster convergence for the model.
Results discussion for IEEE14 bus system
In terms of performance, the models were judged primarily based on RACC. Two factors were considered to evaluate system constraints: computational and system memory requirements, i.e., time taken for training and number of trainable parameters of each model, respectively. Attentionbased bidirectional LSTM model had the highest value of RACC (94.2%) as compared to MLP (83.8%), CNN (92.9%), LSTM (93.7%), and Hybrid models (93.4%). Apart from MLP, in terms of RACC, almost all the other models gave a comparable performance. Figure 4 shows the learning curve of the bestperforming attentionbased and leastperforming MLP models.
The high performance of the attentionbased model came at a cost of higher memory and computation requirements. Its number of trainable parameters were \(6.2\times {10}^{5}\), which was almost 10 times more than the MLP. With the performance compromise in RACC of only 1.28%, CNN used nearly six times less trainable parameters. The second most expensive model in terms of system memory was the Hybrid model with \(4.0\times {10}^{5}\) trainable parameters.
The memory requirement of each model, along with its training mechanism, correlated directly with the computational time required to train the model. As expected, the attentionbased model took the most time i.e., almost 27 min, about seven times more than the MLP model. CNN, on the other hand, took only about 5 min. The comparison between all the models in terms of RACC, time taken, and the number of trainable parameters is given in Fig. 5.
Figure 6 presents the receiver operating characteristic (ROC) curve which illustrates the performance of MLP, CNN, CNNLSTM, and LSTM models. All four methods achieved an AUC of 1, indicating their excellent discriminatory power. However, the ROC curve for the MLP method shows a slight bend away from the top left corner compared to the other curves. This suggests that the MLP method has a slightly higher false positive rate relative to the other methods, while still maintaining a high true positive rate. On the other hand, the curves for CNN, CNNLSTM, and LSTM are closer to the top left corner, indicating their strong performance in accurately distinguishing between positive and negative instances.
Similar remarks could be observed from precisionrecall curve of the four models which is presented in Fig. 7. The MLP curve demonstrates a slight bend away from the top right corner, indicating a higher false positive rate compared to the other models. On the other hand, the CNN curve also shows a bend, but it is closer to the top right corner, suggesting a relatively lower false positive rate. The curves for CNNLSTM and LSTM are both closer to the top right corner, indicating their strong ability to maintain high precision while minimizing false positives.
Figure 7 displays the PrecisionRecall curve for MLP, CNN, CNNLSTM, and LSTM models. MLP and CNN models exhibits a similar performance where their curves bend away from the top left corner indicating a higher false positive rate. Similar to ROC curve, CNNLSTM and LSTM precisionrecall curves remain closer to the top left corner, suggesting lower false positive rates. In terms of precision, all models achieve high precision with perfect or nearperfect average precision scores. Overall, the models demonstrate strong discriminatory ability, but MLP and CNN show a slightly higher false positive rate compared to CNNLSTM and LSTM.
It could be concluded that CNN offered the best tradeoff between computational/memory requirements and performance. Attentionalbased models are only recommended for smaller bus systems, like IEEE14, if systems constraints are not an issue and performance is of utmost priority.
Results discussion for IEEE118 bus system
Similar criteria of judgment were used for the IEEE118 bus system as well. LSTM model and CNN had highest value of RACC (81.8%) as compared to MLP (75.2%), Attentionbased model (63.5%) and Hybrid models (79.7%). In contrast to the IEEE14 Bus system, the performance of models showed higher deviation, and not all of them showed comparable performance.
Figure 8 plots the learning curve of the bestperforming LSTM model and the leastperforming attentionbased model. Here it is important to note that the later model had the highest RACC, on the test dataset, in the IEEE14 bus system's data. This indicates that attentionbased models do not scale well to the higher dimensions of data.
Another difference, in terms of system memory requirements, was that the number of trainable parameters in the IEEE118 bus system among different models was comparable. As shown in Fig. 9, the attentionbased model had the highest number of trainable parameters, i.e., \(8.8\times {10}^{6}\) which were only about two times the number of parameters in CNN. This number for the rest of the models lies between the two models. The second most expensive model in terms of system memory was LSTM with a \(4.3\times {10}^{5}\) trainable parameters.
On the other hand, the computational time required to train each model showed a much higher deviation—as compared to the number of parameters. As expected, the attentionbased model took the most time, i.e., almost 127 min, which was almost 12 times more than the MLP model. And as compared to the IEEE14 bus system, this model took almost five times more time for training.
The higher computational and memory requirements do not translate well into higher performance for the IEEE118 bus system. On the contrary, it showed the opposite trend. Both LSTM and CNN gave the same performance, but the former one took almost four times for training. Hence, CNN is the best model recommended for the IEEE118 bus system. The comparison between all the models in terms of RACC, time taken, and the number of trainable parameters is given in Fig. 9.
Based on the ROC curve, presented in Fig. 10, the MLP and CNN models demonstrated exceptional discriminatory ability, achieving an AUC of 1 with high sensitivity and low false positive rates. The CNNLSTM model also achieved an AUC of 1, showcasing accurate classification. The LSTM model exhibits a lower discriminative power with an AUC of 0.98. These results highlight the impressive performance of these models in distinguishing between positive and negative attack instances.
The exact models’ behaviour is exhibited in precisionrecall curve, presented in Fig. 11, where MLP, CNN and CNNLSTM models achieved perfect precision with average precision score (AP) of 1. This result indicates the models’ accurate positive predictions without any false positives, regardless of the threshold or cutoff used to classify instances. However, the LSTM model had an AP of 0.94, suggesting a few false positive predictions compared to the other models.
Given that MLP has a nearly excellent PrecisionRecall and ROC curves but has low accuracy, a classification machine learning method can have excellent precisionrecall and ROC curves while having low accuracy due to class imbalance and misclassification errors. The model performs well in identifying positive instances, as reflected in the precisionrecall curve, and effectively discriminates between positive and negative classes, indicated by the ROC curve. However, the dominance of the majority class and misclassifications result in low overall accuracy.
In summary, the comparison between the ROC curve and PrecisionRecall curve results indicates strong performance for all four classification models (MLP, CNN, CNNLSTM, LSTM). The models demonstrate excellent discriminative ability, with minor variations in precision. CNN shows consistent performance over all the evaluation techniques, while CNNLSTM and LSTM exhibit slightly lower precision and MLP exhibits low accuracy.
Resourceperformance tradeoff analysis
To quantify the tradeoff between computational requirements and performance of different models for FDI attacks detection, a manual grid search was conducted. The number of layers and the number of neurons in each layer were varied. The above experimentations considered the number of layers and neurons to be 2 and 128, respectively. However, this analysis is required to choose the optimal configuration for any model based on the application requirement and resources availability. The number of layers was set to 1, 2, and 4, and the number of neurons in each layer was set to 64, 128, and 256. This yielded nine combinations for each model. Against each such combination, the test RACC, \({F}_{1}\)Score and time taken for training have been provided in Table 2. Only for the IEEE118 bus system's LSTM model, attentionbased model, and hybrid model, the most complex configuration refers to the number of layers equal to 4 and the number of neurons in each layer equal to 256.
The RACC test for all architectures, consistently increases for the IEEE14 power system as the number of layers and neurons per layer increases by a negligible amount or less than 8%. The MLP averages at 81.94% for all its IEEE14 power systems. While the average values for the CNN, LSTM attentionbased LSTM and hybrid CNNLSTM architectures are at 92.05%, 93.71%, 84.52%, and 93.21% respectively. The LSTM architecture has the best average score of RACC with a margin up to 13%, but the attentionbased LSTM has the highest induvial score of 94.53% with 2 layers and 256 neurons.
The RACC percentages for architecture for the IEEE118 power system are averaged at 72.53%, 80.84%, 81.73%, 62.47% and 81.04% for MLP, CNN, LSTM attentionbased LSTM and hybrid CNNLSTM respectively. The RACC test values are overall lower than their IEEE14 power system counterparts. The LSTM model provided the best results for the IEEE118 power system as well with a margin up to 23.57%.
The time taken by the IEEE14 and IEEE118 power system for the MLP architecture is 4.18 and 12.80, CNN 5.05 and 18.56, LSTM 15.00 and 52.02, attentionbased LSTM 32.92 and 118.73 and hybrid CNNLSTM 15.65 and 60.57 respectively. This shows that the MLP is the least timeconsuming architecture, however, it lacks in its accuracy when compared with the other models.
The attentionbased LSTM and the hybrid CNNLSTM required around twice as long as the other architectures in both IEEE14 and IEEE 118 power systems. The time taken by an increase in the number of layers and number of neurons directly correlated to this increase of time needed for each model. For example, the LSTM architecture needed 6.09 min for one layer, 64 neuron instance, while 28.56 min was needed for a 4 layered 256 neuron instance.
The above analysis showed that the choice of models and their configuration was more important than the computational resources available. The better configuration consisted of a lower number of layers and a higher number of neurons in each layer. Finally, such analysis also provided insights into making a better choice for choosing the optimal configuration of the model depending on the application or the error tolerance requirement(s) and computational resources available
Scalability
The performance of deep learning models increased substantially with the increase in the \({l}_{2}\)norm of the generated attacks. Higher value of \({l}_{2}\)norm indicated that the generated FDI attacks tried to cause major disruption in the power bus system and thus are easily detectable as compared to the subtle attacks which were generated from the lower value of \({l}_{2}\)norm. For this analysis, all four models, except the attentionbased model, were considered because it does not provide proportional gain in performance for the computational cost required to train them—evident from the tradeoff analysis in Sect. ‘‘Evaluation Metrics’’
For the IEEE14 power system, Increasing the \({l}_{2}\)norm from 1 to 5 gave the increase in RACC of 11.98%, 6.48%, 5.52%, and 6.24% for MLP, CNN, LSTM, and CNNLSTM, respectively. The effect of \({l}_{2}\)norm on the performance of all the mentioned has been given in Fig. 12. The performance of the models was directly proportional to the severity of the attack, i.e., \({l}_{2}\)norm.
For the IEEE118 power system, Increasing the \({l}_{2}\)norm from 1 to 5 gives the increase in RACC of 18.09%, 15.35%, 15.49%, and 16.31% for MLP, CNN, LSTM, and CNNLSTM respectively. This change is significantly higher for the IEEE118 power system as compared to the IEEE14 power system. For largest \({l}_{2}\)norm, the highest RACC achieved by CNN was 97.36% which was only 2.26% lower than the highest accuracy achieved in the IEEE14 power system. the overall trend remained the same as the IEEE14 power system, proportional relationship between RACC and \({l}_{2}\)norm. The effect of \({l}_{2}\)norm on the performance of all the mentioned has been given in Fig. 13.
Discussion
Overall, despite the high performance of LSTM and attentionbased LSTM models, they come at the cost of increased training time and computational resources. On the other hand, simpler models like MLP are not suitable due to their low accuracy results. CNN achieved moderately lower accuracy but did so in onefourth of the time taken by the attentionbased LSTM model. Moreover, the ROC and PrecisionRecall Curves demonstrated the balanced performance of CNN model in classifying the location of the attack duo to its high discrimination power which is due the perfect AUC scores. However, MLP archives similar AUC scores, but it failed to attain high accuracy. This comparison provides decisionmakers with a clear understanding of the tradeoffs, enabling them to determine which model is most suitable for their specific application.
Despite the reliable performance of deep learning models, they often exhibit low interpretability. It can be challenging to determine which features or factors significantly influence the model’s results. However, visualization techniques, such as feature maps, can be employed to gain insights into how different features impact the model’s decisions.
Conclusion
Lately, extensive work has been conducted on detecting FDIA using deep learning. Hence, this research explores the recent development in this field, and performs an extensive comparison on the most common high performing models in detecting FDIA at the granularity of each node in the IEEE14 and IEEE118. The models, CNN, LSTM, CNNLSTM and attentionbased model were analysed and compared against MLP as baseline model. Various evaluation criteria were used to ensure that the model's performance in all aspects was measured fairly and to verify its robustness, such as RACC test, F1 score, computational time, memory space required, PR plot, and ROC plot.
The attentionbased bidirectional model performs well for the small, IEEE14, bus system while LSTM/CNN works best for the larger, IEEE118, bus system according to our results. CNN on the other hand performs well for both bus systems and takes less training time. For Intraalgorithms comparison, the configuration of each model was varied for 9 combinations of number of layers and neurons in each layer to select an optimal configuration depending on the application or the error tolerance requirement(s) and computational resources available. The results suggest that the best arrangement had fewer layers and a greater number of neurons in each layer. In case of the larger IEEE118 bus system and for systems with resource constraints, it is recommended to use a single layer CNN model with 64 neurons. The later configuration is more suitable for deployment on systems with higher computational and memory capacity. Finally, the effect of \({l}_{2}\)norm, which indicated the severity of the FDI attack, was quantified for five different variants of the dataset. The highest jump in performance was seen for MLP model. The RACC for both the IEEE14 and IEEE118 bus systems showed a significant increase.
In future work, complex models can be explored such as reinforcement learning and temporal modelling. Moreover, given the criticality and sensitivity of this application, realtime test cases and data should be considered in the benchmarking.
Data availability
No datasets were generated or analysed during the current study.
References
Bebis G, Georgiopoulos M (1994) Feedforward neural networks. IEEE Potentials 13(4):27–31. https://doi.org/10.1109/45.329294
Bi S, Zhang YJ (2014) Using covert topological information for defense against malicious attacks on DC state estimation. IEEE J Sel Areas Commun 32(7):1471–1485. https://doi.org/10.1109/JSAC.2014.2332051
Bitirgen K, Filik ÜB (2023) A hybrid deep learning model for discrimination of physical disturbance and cyberattack detection in smart grid. Int J Crit Infrastruct Prot 40:100582. https://doi.org/10.1016/J.IJCIP.2022.100582
Brown MA, Zhou S (2019) Smartgrid policies in advances in energy systems. John Wiley & Sons, Ltd, Hoboken
Chauhan NK, Singh K (2018) A review on conventional machine learning vs deep learning. 2018 Int Conf Comput Power and Commun Technol (GUCON). https://doi.org/10.1109/GUCON.2018.8675097
Chopade P, Bikdash M (2016) New centrality measures for assessing smart grid vulnerabilities and predicting brownouts and blackouts. Int J Crit Infrastruct Prot 12:29–45. https://doi.org/10.1016/J.IJCIP.2015.12.001
Cui L, Qu Y, Gao L, Xie G, Yu S (2020) Detecting false data attacks using machine learning techniques in smart grid: a survey. J Netw Comput Appl 170:102808. https://doi.org/10.1016/j.jnca.2020.102808
Ding D, Han QL, Xiang Y, Ge X, Zhang XM (2018) A survey on security control and attack detection for industrial cyberphysical systems. Neurocomputing 275:1674–1683. https://doi.org/10.1016/j.neucom.2017.10.009
Esmalifalak M, Liu L, Nguyen N, Zheng R, Han Z (2017) Detecting stealthy false data injection using machine learning in smart grid. IEEE Syst J 11(3):1644–1652. https://doi.org/10.1109/JSYST.2014.2341597
Feng H, Han Y, Li K, Si F, Zhao Q (2024) Locational detection of the false data injection attacks via semisupervised multilabel adversarial network. Int J Electr Power Energy Syst 155:109682. https://doi.org/10.1016/j.ijepes.2023.109682
GitHub—wsyCUHK/WSYCUHK_FDIA: Locational detection of false data injection attack in smart grid: a multilabel classification approach. Accessed. 30 Jan. 2023. https://github.com/wsyCUHK/WSYCUHK_FDIA
He Y, Mendis GJ, Wei J (2017) Realtime detection of false data injection attacks in smart grid: a deep learningbased intelligent mechanism. IEEE Trans Smart Grid 8(5):2505–2516. https://doi.org/10.1109/TSG.2017.2703842
Ibraheem R, Eddin ME, Massaoudi M, AbuRub H (2024) Enhancing locational FDIA detection in smart grids a hyperparameter optimization analysis in 2024. 4th Int Conf Smart Grid Renew Energy (SGRE). https://doi.org/10.1109/SGRE59715.2024.10428762
Kallitsis MG, Bhattacharya S, Stoev S, Michailidis G (2016) Adaptive statistical detection of false data injection attacks in smart grids in. IEEE Global Conf Signal Inform Process (GlobalSIP) 2016:826–830. https://doi.org/10.1109/GlobalSIP.2016.7905958
Kundu A, Sahu A, Serpedin E, Davis K (2020) A3D: Attentionbased autoencoder anomaly detector for false data injection attacks. Electric Power Syst Res 189:106795. https://doi.org/10.1016/j.epsr.2020.106795
Kurt MN, Yılmaz Y, Wang X (2019) Realtime detection of hybrid and stealthy cyberattacks in smart grid. IEEE Trans Inf Forensics Secur 14(2):498–513. https://doi.org/10.1109/TIFS.2018.2854745
Li Y, Wei X, Li Y, Dong Z, Shahidehpour M (2022) Detection of false data injection attacks in smart grid: a secure federated deep learning approach. IEEE Trans Smart Grid 13(6):4862–4872. https://doi.org/10.1109/TSG.2022.3204796
Lukicheva I, Pozo D, Kulikov A (2018) cyberattack detection in intelligent grids using nonlinear filtering in 2018. IEEE PES Innov Smart Grid Technol Conf Eur (ISGTEurope). https://doi.org/10.1109/ISGTEurope.2018.8571457
Majidi SH, Hadayeghparast S, Karimipour H (2022) FDI attack detection using extra trees algorithm and deep learning algorithmautoencoder in smart grid. Int J Crit Infrastruct Prot 37:100508. https://doi.org/10.1016/J.IJCIP.2022.100508
Mukherjee D (2023) Detection of datadriven blind cyberattacks on smart grid: a deep learning approach. Sustain Cities Soc 92:104475. https://doi.org/10.1016/j.scs.2023.104475
Mukherjee D, Chakraborty S, Abdelaziz AY, ElShahat A (2022) Deep learningbased identification of false data injection attacks on modern smart grids. Energy Rep 8:919–930. https://doi.org/10.1016/J.EGYR.2022.10.270
Mukherjee D, Chakraborty S, Ghosh S (2022) Deep learningbased multilabel classification for locational detection of false data injection attack in smart grids. Electr Eng 104(1):259–282. https://doi.org/10.1007/s00202021012786
Musleh AS, Chen G, Dong ZY (2020) A survey on the detection algorithms for false data injection attacks in smart grids. IEEE Trans Smart Grid 11(3):2218–2234. https://doi.org/10.1109/TSG.2019.2949998
Niu X, Li J, Sun J, Tomsovic K (2016) Dynamic detection of false data injection attack in smart grid using deep learning. IEEE Power Energy Soc Instit Electr Electron Eng 1:1–6
Niu X, Li J, Sun J, Tomsovic K (2019a) “Dynamic detection of false data injection attack in smart grid using deep learning”, in. IEEE Power Energy Soc Innov Smart Grid Technol Conf (ISGT) 2019:1–6. https://doi.org/10.1109/ISGT.2019.8791598
Shi W, Wang Y, Jin Q, Ma J (2018) PDL: an efficient predictionbased false data injection attack detection and location in smart grid. Ann Computer Softw Appl (COMPSAC). https://doi.org/10.1109/COMPSAC.2018.10317
Teixeira A, Shames I, Sandberg H, Johansson KH (2015) A secure control framework for resourcelimited adversaries. Automatica 51:135–148. https://doi.org/10.1016/j.automatica.2014.10.067
Wang S, Bi S, Zhang YJA (2020) Locational detection of the false data injection attack in a smart grid: a multilabel classification approach. IEEE Internet Things J 7(9):8218–8227. https://doi.org/10.1109/JIOT.2020.2983911
Xu B, Guo F, Wen C, Deng R, Zhang WA (2014) Detecting false data injection attacks in smart grids with modeling errors a deep transfer learning based approach. arXiv preprint. https://doi.org/10.48550/ARXIV.2104.06307
Acknowledgements
The authors thank OpenUAE for OpenUAE Research and Development Group for their support.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
QN and MA conceptualized the study and acquired resources and funds. MAA conducted the main experiments and evaluation tests and wrote the original draft. TI and RB edited and reviewed the manuscript. BA and YB also worked on conducting experiments. OA validated the results and reviewed the manuscript along with QN and MA. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethical approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons AttributionNonCommercialNoDerivatives 4.0 International License, which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/byncnd/4.0/.
About this article
Cite this article
Nasir, Q., Abu Talib, M., Arshad, M.A. et al. Comparison of deep learning algorithms for site detection of false data injection attacks in smart grids. Energy Inform 7, 71 (2024). https://doi.org/10.1186/s42162024003819
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s42162024003819