 Review
 Open Access
 Published:
Anomaly detection in quasiperiodic energy consumption data series: a comparison of algorithms
Energy Informatics volume 5, Article number: 62 (2022)
Abstract
The diffusion of domotics solutions and of smart appliances and meters enables the monitoring of energy consumption at a very fine level and the development of forecasting and diagnostic applications. Anomaly detection (AD) in energy consumption data streams helps identify data points or intervals in which the behavior of an appliance deviates from normality and may prevent energy losses and break downs. Many statistical and learning approaches have been applied to the task, but the need remains of comparing their performances with data sets of different characteristics. This paper focuses on anomaly detection on quasiperiodic energy consumption data series and contrasts 12 statistical and machine learning algorithms tested in 144 different configurations on 3 data sets containing the power consumption signals of fridges. The assessment also evaluates the impact of the length of the series used for training and of the size of the sliding window employed to detect the anomalies. The generalization ability of the top five methods is also evaluated by applying them to an appliance different from that used for training. The results show that classical machine learning methods (Isolation Forest, OneClass SVM and Local Outlier Factor) outperform the best neural methods (GRU/LSTM autoencoder and multistep methods) and generalize better when applied to detect the anomalies of an appliance different from the one used for training.
Introduction
Appliancelevel energy consumption monitoring is a core component of the control system of smart buildings (Shah et al. 2019; Shaikh et al. 2014). The consumption data can be either directly collected with such devices as smart plugs, or inferred with non intrusive load monitoring (NILM) algorithms able to break down the household aggregate consumption signal into the contributions of individual appliances (Azizi et al. 2021). The analysis of energy consumption data series enables forecasting and diagnostic applications, such as load prediction (Amasyali and ElGohary 2018), anomaly detection (AD) (Fan et al. 2018) and predictive maintenance (Cheng et al. 2020).
AD in temporal data series is the task of identifying data points or intervals in which the time series deviates from normality. AD finds application in different fields such as healthcare, where it applies to the analysis of clinical images (Schlegl et al. 2019) and of ECG data (Chauhan and Vig 2015), cybersecurity, where it is used for malware identification (Sanz et al. 2014), manufacturing, where it helps monitoring machines and prevent break downs (Kharitonov et al. 2022), and in the utility industry, where it supports the early identification of critical events such as appliance malfunctioning (Mishra et al. 2020) and water leakage (Seyoum et al. 2017; Muniz and Gomes 2022). In the energy field, AD may be combined with energy load forecasting to improve accuracy (Koukaras et al. 2021), or integrated as a component for detecting non nominal energy fluctuations for enhancing decision making in energy transfer between microgrids (An interdisciplinary 2021). Energy consumption time series can be collected from home appliances and building systems with complex periodic or quasiperiodic behavior, such as coolers, water heaters and fridges, which present specific challenges when performing anomaly detection. Machine learning and neural models trained on normal data may overfit with respect to the length of the period. This phenomenon makes the model sensible even to small variations of the cycle duration, which can happen during normal functioning (Liu et al. 2020). As a consequence, the detector may emit a high number of false positive alerts when such small variations occur and also may degrade its performances sensibly when used to detect anomalies of an appliance of the same type but with a different cycle duration.
The literature on AD in temporal data series still lacks a systematic comparison of algorithms belonging to different families on quasiperiodic data sets. Therefore the development of an AD application in such a scenario still has to confront with design decisions such as the choice of the most effective algorithm, the minimum duration of the time series to use for training, the minimum size of the signal prediction/reconstruction window needed to identify the anomalous behavior, and the portability of the chosen algorithm from one appliance to another one with “similar” behavior. This paper tries to fill the gap in the literature about AD in quasiperiodic time series by systematically comparing the performances of 12 algorithms representative of different families of approaches. The experiments were performed on 3 distinct data sets regarding the fridges power consumption.
The aim of the experiments is to address the following questions:

Q1 How do the selected algorithm compare in the AD task on quasiperiodic time series under multiple performance metrics?

Q2 For the algorithms that require training, what is the relationship between the length of the training series and the performances?

Q3 For the algorithms that exploit a windowbased approach for the prediction, what is the relationship between the length of the window and the performances?

Q4 What is the generalization capability of the methods? How does performance degrade when a method trained on an appliance is tested on the time series produced by a distinct appliance of the same type?
The essential findings can be summarized as follows:

The classical ML algorithms Isolation Forest (ISOF), OneClass SVM (OCSVM), and Local Oulier Factor (LOF) outperform the best neural models (GRU/LSTM autoencoder and multisteps methods)

Two weeks of training data are sufficient for most methods, with the multisteps approaches attaining a modest improvement if one month of data is used.

The length of the prediction/reconstruction window has a different impact on neural and nonneural methods.

ISOF and OC SVM are less dependent on the training set with respect to the neural models, which have a sensible performance decay when tested on an appliance different from the one used for training.

The top result of all the experiments is attained by ISOF on the Fridge3 time series, trained with a subsequence of length equal to one month and with a window size of 2 \(\times\) period: Precision = 0.947, Recall = 0.965, \(\hbox {F}_{1}\) score = 0.956.
The above mentioned findings can help understand better the requirements and performances of AD algorithms on quasiperiodic data series so as to design more effective household energy consumption applications, e.g., by equipping the mobile apps that are nowadays bundled with smart plug products with functionalities for consumption monitoring, energy saving recommendations and alerting of potential appliance malfunctioning.
The rest of the article is organised as follows: Section “Related work” overviews the state of the art in anomaly detection. Section Experimental settings describes the experimental configuration, including the description of the dataset and of the evaluated algorithms. Section “Experimental results” discusses the results of the performed experiments. Section “Qualitative analysis of results” discusses qualitatively a few examples of the predictions made by the reviewed methods. Finally, Section “Conclusions” draws the conclusions and illustrates our future work.
Related work
Anomaly detection in temporal data series exploits data collected with a broad spectrum of sensors in diverse fields, such as weather monitoring, natural resources distribution and consumption (e.g., water and natural gas), network traffic surveillance, and electrical load measurement (Firth et al. 2017; A platform for Open 2022; Makonin et al. 2016; Shakibaei 2020). As an example, the work in Makonin et al. (2016) discusses the use of residential home smart meters for data collection and highlights how such series often exhibit anomalous behaviors. Raw data must be preprocessed to get ready for further analysis. Besides the usual operations of data cleaning and validation, a prominent task is data annotation, which associates data points or intervals with the specifications of significant events, such as change points and anomalies. For example, Rimor Rashid et al. (2018) is a timeseries data annotator supporting the labelling of data with anomaly tags, which can be used as ground truth for training and evaluating predictive models.
AD can be conducted in both univariate (Braei and Wagner 2020) and multivariate time series (Su et al. 2019; Li et al. 2018; BlázquezGarcía et al. 2021). In the case of multivariate time series, exploiting variable correlation may be necessary for reducing the number of parameters needed to model the problem (Pena and Poncela 2006). Examples of multivariate time series dimensionality reduction techniques are principal components analysis (Cook et al. 2019; Pena and Poncela 2006), canonical correlation analysis (Box and Tiao 1977), and factor modelling (Pena and Box 1987).
AD approaches can be classified in two main families (Cook et al. 2019): nonregressive and regressive. Nonregressive approaches rely on the fundamental statistical quantities computed on the time series (e.g., mean and variance) and combine them with fixed thresholds, but their effectiveness is limited (Cook et al. 2019). The authors of Kao and Jiang (2019) proposed a statistical AD framework using the DickeyFuller test, the Fourier transform, and the Pearson correlation coefficient to analyze periodic time series. Performance evaluation on five NAB datasets (Ahmad et al. 2017) showed that the proposed approach performs well on the NAB Jumps periodic data set and outperforms the models it was compared to. Other types of nonregressive techniques are ML methods for time series analysis. In Oehmcke et al. (2015) the Local Outlier Factor (LOF) method was employed to identify anomalous events in the marine domain and attained 83.4% precision. The Isolation Forest (ISOF) algorithm has been applied to streaming data in Ding and Fei (2013), achieving an AUC score of 0.98 in one of the test dataset. In Zhang et al. (2008) the OneClass Support Vector Machines (OCSVM) has been implemented for the identification of network anomalies, and for the test set, the outliers identified perfectly match the human visual detection result.
Regressive approaches compute a model of the time series generation process. In the case of AD, an autoregression model is used to forecast the variable of interest from its past values. Autoregressive models include methods based on Autoregressive Moving Average (ARMA) (Pincombe 2005; Kadri et al. 2016; Kozitsin et al. 2021) and on Neural Networks, such as Autoencoders (AE) (Yin et al. 2020; Li et al. 2020) and Recurrent Neural Networks (RNNs) (Canizo et al. 2019; Malhotra et al. 2015). Forecastingbased AD approaches are divided into singlestep and multistep methods depending on the number of predicted points. The former strategy is preferable for shortterm forecasting (i.e., minutes, hours, and days) and the latter for longterm data series analysis.
In the electric load analysis domain, the work in Masum et al. (2018) studies the problem of time series forecasting for electric load measurements and shows that Long ShortTerm Memory (LSTM), a deep learning model, outperforms AutoRegressive Integrated Moving Average (ARIMA), a statisticalbased model, on three data sets obtained from the Open Power System Data on electric load in Great Britain, Poland, and Italy (A platform for Open 2022). Zhang et al. (2019) shows the importance of an Fast Fourier Transform (FFT) based periodicity preprocessor to extract the period in smart grids time series. Pereira et al. (2018) proposes the use of Variational Autoencoders (VAE) for the unsupervised anomaly detection in solar energy generation time series and the results show that the trained model is able to detect anomalous patterns by using the probabilistic reconstruction metrics as anomaly scores. Himeur et al. (2021) surveys several Artificial Intelligence methods for anomaly detection in buildings’ energy consumption, identifying several factors (e.g., occupancy and outdoor temperatures) that influence time series behavior.
In the specific field of periodic data series analysis, Zhang et al. (2020) employs a periodicity preprocessor to find the time series period and segment the data into windows. Then it exploits a combination of an RNN and a CNN to detect anomalies achieving an \(\hbox {F}_{1}\) score near 0.9 on all the test datasets. Zhang et al. (2019) also uses a periodicity preprocessor, based on the Fourier transform, and maps multiple periods onto a single cycle to identify deviations across subsequent periods. Pereira et al. (2018) uses BiLSTM to detect anomalies and proposes the use of attention maps to explain the results. Capozzoli et al. (2018) encodes periodic time series using letters as a data size reduction technique. The classification process led to robust results with a global accuracy that ranged between 80% and 90%. These works show the advantages of preprocessing to exploit the data periodicity and of dimensionality reduction techniques and discuss results interpretability.
The proliferation of time series analysis methods and of AD specific approaches has spawned a stream of research focused on comparing the performance of alternative techniques. For example, the work in Masum et al. (2018) compares the multistep forecasting performance of ARIMA and LSTMbased RNN models and shows that the LSTM model outperforms the ARIMA model for multistep electric load forecasting. Our preliminary work (Zangrando et al. 2022) compares CNNpowered and RNNpowered AD methods with OneClass Support Vector Machines and Isolation Forest techniques on one quasiperiodic data set, using standard metrics (precision, recall, \(\hbox {F}_{1}\) score). In this paper we deepen the analysis assessing performances under multiple metrics, investigating the impact of the training subsequence duration and of the analysis window size, and contrasting the generalization capacity of the reviewed approaches.
Experimental settings
Data set
The experiments exploit a fridge energy consumption data set collected using smart plugs. The energy consumption data have been collected in Greek residential households using the BlitzWolf BWSHP2 smart plugs, which allow exporting the time series through an API. The data collection system, the assessed algorithms and the evaluation framework were all implemented in Python. The time series in the data set record the active power consumption of three fridges for over 2 months, with 1 minute data resolution. The time series have been divided into subsequences for training, validation, and testing of the methods. Table 1 summarizes the data split.
When working in normal conditions, the energy consumption curve of a fridge displays a cyclic behavior alternating between a high consumption state (ON) and a low consumption stage (OFF). Figure 1 shows an example of the consumption data of one appliance.
Data set analysis
Periodicity analysis Normal fridge consumption shows a cyclic behavior. Periodicity analysis aims at detecting the mean period corresponding to an ONOFF cycle and possibly to other longer patterns (e.g., seasonal effects). It is a preliminary step before the application of AD and requires a nonanomalous subseries, which can be created by manually removing anomalies from the training subsequence. The Fast Fourier Transform (FFT) is applied on the anomalyfree subsequence to map the data into the frequency domain and the periodicity is defined as the inverse of the frequency corresponding to the highest power in the FFT, as proposed in Kao and Jiang (2019). Table 2 summarizes the periodicity, expressed in minutes of the three data sets. The periods range from 45 minutes to 1h 40 minutes. No seasonal affect is found because the train set refers to only one month. Figure 2 shows the power spectrum computed for one of the three appliances.
Ground truth annotation
For training and testing purposes, the energy consumption time series have been annotated with ground truth (GT) metadata to specify the points that deviate from normality. Three independent annotators have labeled the data points, with a Boolean tag (normal/anomalous) and with a categorical label denoting the type of the anomaly, with the interface shown in Figure 3.
Anomaly classes and their distribution
The anomalies have been distinguished in the following categories: Continuous OFF state, when the appliance is in the low consumption state for a long time, Continuous ON state, when the appliance is in the consumption state for an abnormally long time, Spike, when the appliance has an abnormal consumption peak possibly preceded by a ramp and followed by a decay period, Spike + Continuous, when the appliance has a consumption peak followed by a prolonged ON state, Other, when the anomaly does not follow a welldefined pattern. Figure 4 shows the distribution of the anomaly categories in the data set of the three fridges. The plots highlight the different anomalous behavior of the appliances. Fridge2 is mainly subject to continuous ON cycles. Fridge 1 shows a similar pattern, but the prolonged ON states are preceded by an abrupt increase in the consumption. Fridge3 is subject to a more detectable anomalous behavior because almost 95% of the anomalies are of spike type, which are easier to detect also visually.
GT anomaly duration distribution. Figure 5 shows the GT anomaly duration distribution on the data series of the three fridges. The distributions of Fridge1 and Fridge2 are centered close the time series period, which suggests the presence of anomalies shorter than an ONOFF cycle. The distribution of Fridge3 is centered around values higher than the mean ONOFF cycle duration, which is typical of the transient behavior caused by high consumption spikes.
Compared algorithms
Algorithm list and definitions
The algorithm selection considered the most common methods used in the reviewed studies and their nature (statistical, regressive, neural) so as to achieve a balanced representation of the different approaches.

1
Basic Statistics is an extension of the method presented in Kao and Jiang (2019) for periodic series. The first step analyzes the anomalyfree training data series to determine the periodicity. Then, the anomalyfree train set is divided into nonoverlapping windows of the same size as the period and the Pearson productmoment correlation coefficient is computed on all the pairs of contiguous windows to check whether the time series is periodic within the two windows. If it is periodic, the ratio \(R_{std} = \frac{Std_{current}  Std_{previous}}{Std_{previous}}\) is computed. An anomaly occurs if \(R_{std}\) exceeds a threshold \(\tau\), defined as follows. \(R_{std}\) is calculated for each window pairs in the train set and the maximum value (\(R_{max}\)) allowed in a nonanomalous time series is found. Then the threshold \(\tau\) is determined on the validation set by performing a grid search. Given a set of possible thresholds \(\tau _\alpha = R_{max}(1+\alpha )\), with \(\alpha\) ranging from 0 to 10 with step 0.1, the threshold \(\tau\) is defined as the value corresponding to the best \(F_1\) score obtained by applying the anomaly definition rule on the validation set. Finally, the same rule is applied to the test set using the computed threshold value.

2
AutoRegressive (AR) (Hyndman and Athanasopoulos 2021) is an autoregression model exploiting past data to predict current data. The prediction model is defined as:
$$\begin{aligned} y_t = c + \sum _{i=1}^{p} \phi _i y_{ti} + \varepsilon _t \end{aligned}$$(1)where \(c, \phi _i\) are the model parameters and \(\varepsilon _t\) is a white noise term. Anomalies are computed from the prediction error by thresholding.

3
AutoRegressive Integrated Moving Average (ARIMA) (Hyndman and Athanasopoulos 2021; Masum et al. 2018) is a model exploiting past data, differencing of the original time series and a linear combination of white noise terms. A model ARIMA(p, d, q) is defined as:
$$\begin{aligned} y^\prime _t=c + \sum _{i=1}^{p} \phi _i y_{ti}^{\prime } + \sum _{j=1}^{q} \theta _j \varepsilon _{tj} + \varepsilon _t \end{aligned}$$(2)where \(y^\prime _t\) is the differenced time series, \(\varepsilon _t\) is a white noise term and \(c, \phi _i, \theta _j\) are the model parameters. Anomalous points are defined as in AR.

4
Local Outlier Factor (LOF) (Breunig et al. 2000) is a clustering algorithm based on the identification of the nearest neighbors and of local outliers.

5
OneClass SVM (OC SVM) (Schölkopf et al. 1999) is the use of support vector machine (SVM) for novelty detection.

6
Isolation Forest (ISOF) (Liu et al. 2008) is an ensemble method that creates different binary trees for isolating anomalous data points.

7
Gated Recurrent Unit (GRU) (Chung et al. 2014) is a class of Recurrent Neural Network (RNNs) that exploit update gate and reset gate to decide what information should be passed to the output.

8
Gated Recurrent Unit multisteps (GRUMS) is based on GRU and is used to predict multiple consecutive data points in the future.

9
Long ShortTerm Memory (LSTM) (Hochreiter and Schmidhuber 1997) is another class of RNNs exploiting a cell with an input gate, an output gate and a forget gate. Both GRU and LSTM are designed to take advantage of the past context of the data and to avoid the gradient vanishing problem of RNNs.

10
Long ShortTerm Memory multisteps (LSTMMS) is based on LSTM and is used to forecast several consecutive data points.

11
GRUAutoencoder (GRUAE) (Zhang et al. 2019) is a hybrid model using an autoencoder and a GRU network.

12
LSTMAutoencoder (LSTMAE) (Cho et al. 2014) is another hybrid model coupling an autoencoder and an LSTM network.
Training procedure and parameter settings
The hyperparameters of the ISOF, OC SVM, LOF, and ARIMA models are set with Bayesian search employing the holdout set method. For each configuration, the chosen hyperparameters are used to fit the model and the performances are evaluated on the validation set. LOF, OC SVM and ISOF are assessed using the maximum \(\hbox {F}_{1}\)score whereas the ARIMA models using the mean squared error (MSE) on predictions. The hyperparameters yielding the maximum \(\hbox {F}_{1}\) or the lowest MSE are selected.
ARIMA is trained on anomalyfree data to learn normal patterns as done in Yaacob et al. (2010).
ISOF, LOF and OC SVM work on spatial data and thus the univariate time series is projected onto a space \({\mathbb {R}}^n\) with \(n \ge 1\) (Braei and Wagner 2020; Oehmcke et al. 2015). A window of size n is used to extract from the time series \(Nn+1\) vectors of length n of consecutive points, where N is the length of the time series. Then, the spatial algorithms are trained on the projected vectors. At test time, the test set is projected onto \({\mathbb {R}}^n\) and the score of each projected vector is computed. The anomaly score of a point in the time series is defined as the average of all the anomaly scores of the vectors that contain the point. For all the neural models, training is performed on anomalyfree data.
Table 3 summarizes the relevant features and parameters of the compared methods.
Anomaly definition, GT matching, and performance metrics
Anomaly definition strategies. An anomaly definition strategy specifies how the output of the anomaly detector and the data points of the time series are compared in order to identify whether a point is anomalous. AD algorithms adopt different strategies to identify abnormal points:

Confidence: an anomaly score is directly provided as output by the model.

Absolute and Squared Error (Munir et al. 2018): the anomaly score is defined as the absolute or squared error between the input and the predicted/reconstructed value.

Likelihood (Malhotra et al. 2015): each point in the time series is predicted/reconstructed l times and associated with multiple error values. The probability distribution of the errors made by predicting on normal data is used to compute the likelihood of normal behavior on the test data, which is used to derive an anomaly score.

Mahalanobis (Malhotra et al. 2016): each point in the time series is predicted/reconstructed l times. For each point, the anomaly score is calculated as the square of the Mahalanobis distance between the error vector and the Gaussian distribution fitted from the error vectors computed during validation.

Windows strategy (Keras 2022): a score vector of dimension l is associated with each point. Each element \(s_i\) of the score vector is the mean absolute or mean squared error of the ith predicted/reconstructed window that contains the point.
A threshold \(\tau\) is then applied to the calculated score(s) for classifying the point as normal or anomalous. Table 4 shows the anomaly definition strategies of the compared methods.
Anomaly detection criteria and thresholds. The criteria are the ones adopted in order to identify an anomaly. They are strongly related to the nature of the used algorithm. The anomaly identification criteria used by the compared methods are classified in:

Prediction error prediction models identify anomalies based on the difference between the predicted value and the observed one. Anomalies are identified based on the residuals between the input and the generated data: the higher the difference, the higher the likelihood of an anomaly.

Reconstruction error this criterion applies to all the models that aim at generating an output as close as possible to the input, such as the autoencoderbased models. As for the prediction models, the larger the residual, the higher the probability of an anomaly.

Dissimilarity dissimilarity models classify anomalous points by comparing them with the features or with the distribution of normal points or by matching them with the clusters computed from the normal time series.
Table 4 summarizes the detection criteria used by the different algorithms.
GT matching To evaluate the predictions as true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN), a Point to Point matching strategy has been adopted: each anomalous point is compared only to the corresponding one in the input data series using the GT label.
Performance metrics The evaluation adopts the most widely used machine learning metrics, precision, recall, and \(\hbox {F}_{1}\) score, defined as follow:
Experimental results
In this section we summarize the responses to the four questions introduced in the Introduction. For space reasons we condense the results of the 144 (12 methods \(\times\) 3 training periods \(\times\) 4 window sizes) experiments on 3 data sets and discuss only the essential findings. The complete list of results is published at the address: https://github.com/herrerasergio/ADperiodicTS.
Q1: comparative performances
Figure 6 shows the comparison of the methods over all the data sets and across all the training duration values and sizes of the sliding window. The ISOF method consistently achieves the best \(\hbox {F}_{1}\) score, followed by OC SVM and LOF. The AE and MS neural methods have comparable performances. The multistep approaches exhibit a more consistent behavior yielding smaller values of the standard deviation and the GRUAE method performs slightly worse than the other approaches. The neural methods that predict only one point in the future (LSTM and GRU) have low performance and a rather inconsistent behavior. This is expected due to the high sampling frequency, which makes one step prediction ineffective to detect anomalies. Of the remaining nonneural methods, ARIMA and Basic Statistic are positioned at the low end of the performance range.
The top result on all the experiments is attained by ISOF on the Fridge3 time series, trained with a subsequence of length equal to one month and with a window size of 2 \(\times\) period: Precision = 0.947, Recall = 0.965, \(\hbox {F}_{1}\) score = 0.956.
A special case is that of AR. The training of the method converges only for the shortest duration of the training subsequence (a half period). However, the trained model delivers on average a good \(\hbox {F}_{1}\) score. It can be observed that AR grossly fails in the accuracy of the predicted values but nonetheless the error of the points that belong to a normal subsequence is very different from the error of the points that lie within an anomalous subsequence, which results in good AD performances.
Figure 7 shows the performance break down by appliance. As expected all methods, but ARIMA and Basic Statistics, perform better on the Fridge3 data set, which contains more recognizable anomalies mostly of a single type (\(\approx 95\%\) of type spike). On the Fridge1 and Fridge2 data sets the performances follow the same ranking as in Fig. 6, with the same top4 methods (ISOF, AR, OC SVM and LOF) and almost equivalent performances of the MS and AE methods. On the Fridge3 data set the methods that predict one step in the future (LSTM and GRU) work better. This analysis highlights that the performances of the models are affected by the considered appliance. Indeed, in Fridge1 the performances are more subject to variations, while in Fridge3 are more consistent. Moreover, ARIMA and Basic Statistics show low performances independently on the complexity of the dataset, which suggests their inadequacy for this kind of problem.
The results are in line with those of the work of Kharitonov et al. (2022) in which the authors compare the performances of alternative techniques to detect failures using manufacturing machine logs and observed that knearest neighbors (KNN) and LOF performed better, while autoencoders could not be considered for deployment in a realcase scenario. Similarly, Elmrabit et al. (2020) found that classical machine learning techniques outperformed deep learning for the AD task in cybersecurity datasets.
Q2: training subsequence duration
Figure 8 shows the variation of the \(\hbox {F}_{1}\) metrics for the 10 methods that could be trained with all the three subsequences (2 weeks, 3 weeks, one month). The results show that the 2 weeks training period is sufficient for most of the methods. Only the multisteps (MS) methods attain a very slight average performance improvement if the training period length extends to 1 month. The results on the time series of Fridge1 and Fridge2 show a similar trend. All the detailed results can be found in the mentioned project repository.
Q3: window length
Figure 9 shows the variation of the \(\hbox {F}_{1}\) metrics with the sliding window size (half a period, one period, two and three periods), limited to the 9 methods that could be trained completely. The results show a difference in the pattern between neural and nonneural methods.
With ISOF and OC SVM the \(\hbox {F}_{1}\) score decreases when the window size increases. With a value greater than half a period the methods progressively loose effectiveness: the variance increases and the \(\hbox {F}_{1}\) score decreases. This is likely the effect of the worse tradeoff between the noise and the context knowledge enclosed in the window.
The AE methods deliver the best \(\hbox {F}_{1}\) score when the window size equals twice the duration of the period. A similar trend is also displayed by MS methods, with LSTMMS showing a slight monotonic increase up to the three periods. The one step neural methods GRU and LSTM are rather insensitive to the window size, but their performance is at the lower end of the range. The LOF approach exhibit the same trend as the AE and MS neural methods.
The value at the (2 \(\times\) period) point of the neural methods shows that such a duration gives sufficient context for encoding the periodic features of the time series well and that going beyond that size is either counterproductive or yields a modest benefit. In the AE methods, the negative effect of the window size extension may be also due to the dimensionality reduction to a latent space operated by the neural architecture, which may become less effective when the dimension of the original space gets too large.
The results on the time series of Fridge2 and Fridge3 show a similar trend. All the detailed results can be found in the mentioned project repository.
Q4: generalization
The generalization experiments assess the top5 methods (ISOF, OC SVM, LOF LSTMAE and GRUAE) on a dataset different from the one on which the methods have been originally trained. Each method is tested in two variants: the original version trained on the first appliance and a version in which the threshold value is finetuned on the validation data series of the target appliance.
Figure 10 contrasts the \(\hbox {F}_{1}\) scores obtained by the baseline version of the algorithm, i.e., the one trained and tested on the same dataset, the \(\hbox {F}_{1}\) scores achieved by fine tuning the threshold on the validation set of the target appliance, and the \(\hbox {F}_{1}\) scores obtained without any fine tuning. The top performing method (ISOF) is also the one that generalizes best, even without fine tuning the threshold. In general, ISOF and OC SVM are less dependent on the training set with respect to the neural models, which have a sensible performance decay when tested on a different appliance. The degradation is more sensible when the test appliances is Fridge3, which has almost all anomalies of type spike, which are absent in Fridge1 and Fridge2.
Qualitative analysis of results
To get a qualitative appreciation of the different behavior of the best models, Fig. 11 directly compares the anomalies detected by ISOF, OC SVM and LSTMAE with the GT anomalies. The detected anomalies are highlighted with a color that depends on the method and the GT anomalies are circled in red.
The plot on the left column show a situation in which all the three methods are able to detect more or less the same anomalous data points. The detected points match well the GT annotations. The plots on the right column show how the methods react to a change of the duration of the ONOFF cycle (an acceleration in the displayed example, which may be caused by a different load of the fridge or by a change in the set point of the thermostat). Only the ISOF method is robust to such an occurrence. The other methods instead signal many normal points as anomalous, because they consider the entire cycle variation as an anomaly. Given that the time series of the appliances are quasiperiodic, as shown in the power spectrum of Fig. 2, the robustness with respect to small variations of the ONOFF cycle is a very relevant benefit of the ISOF method.
Conclusions
In this paper we have discussed the results of the experimental comparison of 12 AD methods on three quasiperiodic data series collected with smart plugs connected to three distinct fridges. The comparison has first assessed the prediction performances, measured with the \(\hbox {F}_{1}\) score metrics, which confirmed that the nonneural machine learning methods ISOF, OC SVM and LOF attain the best results, followed by the autoencoderbased and multistep neural methods (GRUAE, GRUMS, LSTMAE, LSTMMS). In particular, the ISOF method trained with a subsequence of length equal to one month and with a window size of 2 \(\times\) period attained a very good result on a fridge data series containing mostly spike anomalies (Precision = 0.947, Recall = 0.965, \(\hbox {F}_{1}\) score = 0.956).
Next we evaluated the impact of the duration of the subsequence used for training the algorithms, which shows that the 2 weeks training period is sufficient for most of the methods and that the AR and ARIMA algorithms did not complete the training within reasonable time with time series of longer duration.
The impact of the sliding window size was also investigated. Nonneural machine learning algorithms require a shorter window (half of the period is enough), whereas neural models deliver the best performance with a larger window size (two periods in most cases).
Finally, the generalization ability of the top performing methods has been assessed too. The best method (ISOF) is also the one that preserves its performances intact when applied to a different appliance, even without finetuning the threshold on the target appliance.
Future work will further pursue the investigation of AD algorithms on quasiperiodic data series, focusing also on their runtime performance on hardware with memory and processing constraints. The objective is designing a timely, accurate and efficient system for dispatching mobile phone alerts about the potential malfunctioning of home appliances to realworld users.
Availability of data and materials
All the material relative to this article is publicly available in the following repository https://github.com/herrerasergio/ADperiodicTS. The dataset used for the study are private and permission for publication was not granted, it will be included in the repository if permission is granted in the future.
Abbreviations
 AD:

Anomaly detection
 AE:

Autoencoders
 AR:

Autoregressive
 ARIMA:

Autoregressive integrated moving average
 ARMA:

Autoregressive moving average
 BiLSTM:

Bidirectional long shortterm memory
 CNN:

Convolutional neural network
 ECG:

Electrocardiography
 FFT:

Fast fourier transform
 FN:

False negative
 FP:

False positive
 GRU:

Gated recurrent unit
 GRUAE:

Gated recurrent unit autoencoder
 GRUMS:

Gated recurrent unit multisteps
 GT:

Ground truth
 ISOF:

Isolation forest
 KNN:

Knearest neighbors
 LOF:

Local outlier factor
 LSTM:

Long shortterm memory
 LSTMAE:

Long shortterm memory autoencoder
 LSTMMS:

Long shortterm memory multisteps
 MAE:

Mean absolute error
 MS:

Multisteps
 MSE:

Mean squared error
 NILM:

Non intrusive load monitoring
 NN:

Neural networks
 OC SVM:

Oneclass support vector machine
 RNNs:

Recurrent neural networks
 SE:

Squared error
 SVM:

Support vector machine
 TN:

True negative
 TP:

True positive
 VAE:

Variational autoencoders
References
A platform for Open Data of the European power system. https://openpowersystemdata.org/. Accessed 3 June (2022)
Ahmad S, Lavin A, Purdy S, Agha Z (2017) Unsupervised realtime anomaly detection for streaming data. Neurocomputing 262:134–147
Amasyali K, ElGohary NM (2018) A review of datadriven building energy consumption prediction studies. Renew Sustain Energy Rev 81:1192–1205
An interdisciplinary approach on efficient virtual microgrid to virtual microgrid energy balancing incorporating data preprocessing techniques. Computing. 2021;p. 1–42
Azizi E, Beheshti MTH, Bolouki S (2021) Appliancelevel anomaly detection in nonintrusive load monitoring via power consumptionbased feature analysis. IEEE Trans Consumer Electron 67(4):363–371. https://doi.org/10.1109/TCE.2021.3129356
BlázquezGarcía A, Conde A, Mori U, Lozano JA (2021) A review on outlier/anomaly detection in time series data. ACM Comput Surveys (CSUR) 54(3):1–33
Box GE, Tiao GC (1977) A canonical analysis of multiple time series. Biometrika 64(2):355–365
Braei M, Wagner S (2020) Anomaly detection in univariate timeseries: a survey on the stateoftheart. arXiv preprint arXiv:2004.00433
Breunig MM, Kriegel HP, Ng RT, Sander J (2000) LOF: Identifying DensityBased Local Outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. SIGMOD ’00. New York, NY, USA: Association for Computing Machinery; p. 93104. Available from: https://doi.org/10.1145/342009.335388
Canizo M, Triguero I, Conde A, Onieva E (2019) Multihead CNNRNN for multitime series anomaly detection: an industrial case study. Neurocomputing 363:246–260
Capozzoli A, Piscitelli MS, Brandi S, Grassi D, Chicco G (2018) Automated load pattern learning and anomaly detection for enhancing energy management in smart buildings. Energy 157:336–352
Chauhan S, Vig L (2015) Anomaly detection in ECG time signals via deep long shortterm memory networks. In: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE; 2015. p. 1–7
Cheng JC, Chen W, Chen K, Wang Q (2020) Datadriven predictive maintenance planning framework for MEP components based on BIM and IoT using machine learning algorithms. Autom Constr 112:103087
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al (2014) Learning phrase representations using RNN encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555
Cook AA, Mısırlı G, Fan Z (2019) Anomaly detection for IoT timeseries data: a survey. IEEE Internet Things J 7(7):6481–6494
Ding Z, Fei M (2013) An anomaly detection approach based on isolation forest algorithm for streaming data using sliding window. IFAC Proc 46(20):12–17
Elmrabit N, Zhou F, Li F, Zhou H (2020) Evaluation of Machine Learning Algorithms for Anomaly Detection. In: 2020 International Conference on Cyber Security and Protection of Digital Services (Cyber Security); p. 1–8
Fan C, Xiao F, Zhao Y, Wang J (2018) Analytical investigation of autoencoderbased methods for unsupervised anomaly detection in building energy data. Appl Energy 211:1123–1135
Firth S, Kane T, Dimitriou V, Hassan T, Fouchal F, Coleman M, et al (2017) REFIT Smart Home dataset. Available from: https://repository.lboro.ac.uk/articles/dataset/REFIT_Smart_Home_dataset/2070091
Himeur Y, Ghanem K, Alsalemi A, Bensaali F, Amira A (2021) Artificial intelligence based anomaly detection of energy consumption in buildings: a review, current trends and new perspectives. Appl Energy 287:116601
Hochreiter S, Schmidhuber J (1997) Long shortterm memory. Neural Comput 9(8):1735–1780
Hyndman RJ, Athanasopoulos G (2021) Forecasting: principles and practice, 3rd edition. OTexts
Kadri F, Harrou F, Chaabane S, Sun Y, Tahon C (2016) Seasonal ARMAbased SPC charts for anomaly detection: application to emergency department systems. Neurocomputing 173:2102–2114
Kao JB, Jiang JR (2019) Anomaly detection for univariate time series with statistics and deep learning. In: 2019 IEEE Eurasia Conference on IOT, Communication and Engineering (ECICE). IEEE; p. 404–407
Keras (2022) Keras documentation: Timeseries Anomaly detection using an autoencoder;. https://keras.io/examples/timeseries/timeseries_anomaly_detection/. Accessed 3 June 2022
Kharitonov A, Nahhas A, Pohl M, Turowski K (2022) Comparative analysis of machine learning models for anomaly detection in manufacturing. Proc Comput Sci 200:1288–1297
Koukaras P, Bezas N, Gkaidatzis P, Ioannidis D, Tzovaras D, Tjortjis C (2021) Introducing a novel approach in onestep ahead energy load forecasting. Sustain Comput Inf Syst 32:100616
Kozitsin V, Katser I, Lakontsev D (2021) Online forecasting and anomaly detection based on the ARIMA model. Appl Sci 11(7):3194
Li D, Chen D, Goh J, Ng Sk (2018) Anomaly detection with generative adversarial networks for multivariate time series. arXiv preprint arXiv:1809.04758
Li L, Yan J, Wang H, Jin Y (2020) Anomaly detection of time series with smoothnessinducing sequential variational autoencoder. IEEE Trans Neural Netw Learning Syst 32(3):1177–1191
Liu FT, Ting KM, Zhou ZH (2008) Isolation Forest. In: 2008 Eighth IEEE International Conference on Data Mining; p. 413–422
Liu F, Zhou X, Cao J, Wang Z, Wang T, Wang H, et al (2020) Anomaly detection in quasiperiodic time series based on automatic data segmentation and attentional LSTMCNN. IEEE Transactions on Knowledge and Data Engineering. 2020
Makonin S, Ellert B, Bajić IV, Popowich F (2016) Electricity, water, and natural gas consumption of a residential house in Canada from 2012 to 2014. Sci Data 3(1):1–12
Malhotra P, Vig L, Shroff G, Agarwal P, et al (2015) Long short term memory networks for anomaly detection in time series. In: Proceedings. vol. 89; p. 89–94
Malhotra P, Ramakrishnan A, Anand G, Vig L, Agarwal P, Shroff G (2016) LSTMbased encoderdecoder for multisensor anomaly detection. arXiv preprint arXiv:1607.00148
Masum S, Liu Y, Chiverton J (2018) Multistep time series forecasting of electric load using machine learning models. In: International conference on artificial intelligence and soft computing. Springer; p. 148–159
Mishra M, Nayak J, Naik B, Abraham A (2020) Deep learning in electrical utility industry: a comprehensive review of a decade of research. Eng Appl Artif Intell 96:104000
Munir M, Siddiqui SA, Dengel A, Ahmed S (2018) DeepAnT: a deep learning approach for unsupervised anomaly detection in time series. IEEE Access 7:1991–2005
Muniz Do Nascimento W, GomesJr L (2022) Enabling lowcost automatic water leakage detection: a semisupervised, autoMLbased approach. Urban Water J 1–11
Oehmcke S, Zielinski O, Kramer O (2015) Event Detection in Marine Time Series Data. In: Hölldobler S, Peñaloza R, Rudolph S (eds) KI 2015: Advances in Artificial Intelligence. Springer International Publishing, Cham, pp 279–286
Oehmcke S, Zielinski O, Kramer O (2015) Event detection in marine time series data. In: Joint German/Austrian Conference on Artificial Intelligence (Künstliche Intelligenz). Springer; 2015. p. 279–286
Pena D, Box GE (1987) Identifying a simplifying structure in time series. J Am Stat Assoc 82(399):836–843
Pena D, Poncela P (2006) Dimension reduction in multivariate time series. In: Advances in distribution theory, order statistics, and inference. Springer; p. 433–458
Pereira J, Silveira M (2018) Unsupervised anomaly detection in energy time series data using variational recurrent autoencoders with attention. In, (2018) 17th IEEE international conference on machine learning and applications (ICMLA). IEEE 1275–1282
Pincombe B (2005) Anomaly detection in time series of graphs using ARMA processes. Asor Bull 24(4):2
Rashid H, Batra N, Singh P (2018) Rimor: Towards identifying anomalous appliances in buildings. In: Proceedings of the 5th Conference on Systems for Built Environments; p. 33–42
Sanz B, Santos I, UgartePedrero X, Laorden C, Nieves J, Bringas PG (2014) Anomaly detection using string analysis for android malware detection. In: International Joint Conference SOCO’13CISIS’13ICEUTE’13. Springer; 2014. p. 469–478
Schlegl T, Seeböck P, Waldstein SM, Langs G, SchmidtErfurth U (2019) fAnoGAN: fast unsupervised anomaly detection with generative adversarial networks. Med Image Anal 54:30–44
Schölkopf B, Williamson RC, Smola A, ShaweTaylor J, Platt J (1999) Support Vector Method for Novelty Detection. In: Solla S, Leen T, Müller K, editors. Advances in Neural Information Processing Systems. vol. 12. MIT Press; Available from: https://proceedings.neurips.cc/paper/1999/file/8725fb777f25776ffa9076e44fcfd776Paper.pdf
Seyoum S, Alfonso L, Van Andel SJ, Koole W, Groenewegen A, Van De Giesen N (2017) A Shazamlike household water leakage detection method. Proc Eng 186:452–459
Shah AS, Nasir H, Fayaz M, Lajis A, Shah A (2019) A review on energy consumption optimization techniques in IoT based smart building environments. Information 10(3):108
Shaikh PH, Nor NBM, Nallagownden P, Elamvazuthi I, Ibrahim T (2014) A review on optimized control systems for building energy and comfort management of smart sustainable buildings. Renew Sustain Energy Rev 34:409–429
Shakibaei P (2020) Datadriven anomaly detection from residential smart meter data
Su Y, Zhao Y, Niu C, Liu R, Sun W, Pei D (2019) Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining; p. 2828–2837
Yaacob AH, Tan IKT, Chien SF, Tan HK (2010) ARIMA Based Network Anomaly Detection. In: 2010 Second International Conference on Communication Software and Networks; p. 205–209
Yin C, Zhang S, Wang J, Xiong NN (2020) Anomaly detection based on convolutional recurrent autoencoder for IoT time series. IEEE Trans Syst Man Cybern Syst 52(1):112–122
Zangrando N, Herrera S, Koukaras P, Dimara A, Fraternali P, Krinidis S, et al (2022) Anomaly Detection in SmallScale Industrial and Household Appliances. In: Maglogiannis I, Iliadis L, Macintyre J, Cortez P, editors. Artificial Intelligence Applications and Innovations. AIAI 2022 IFIP WG 12.5 International Workshops—MHDW 2022, 5GPINE 2022, AIBMG 2022, ML@HC 2022, and AIBEI 2022, Hersonissos, Crete, Greece, June 1720, 2022, Proceedings. vol. 652 of IFIP Advances in Information and Communication Technology. Springer; p. 229–240. Available from: https://doi.org/10.1007/9783031083419_19
Zhang R, Zhang S, Lan Y, Jiang J (2008) Network anomaly detection using one class support vector machine. In: Proceedings of the International MultiConference of Engineers and Computer Scientists. vol. 1. Citeseer
Zhang C, Patras P, Haddadi H (2019) Deep learning in mobile and wireless networking: a survey. IEEE Commun Surveys Tutorials 21(3):2224–2287
Zhang L, Shen X, Zhang F, Ren M, Ge B, Li B (2019) Anomaly detection for power grid based on time series model. In: 2019 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC). IEEE; p. 188–192
Zhang S, Chen X, Chen J, Jiang Q, Huang H (2020) Anomaly detection of periodic multivariate time series under high acquisition frequency scene in IoT. In: 2020 International Conference on Data Mining Workshops (ICDMW). IEEE; p. 543–552
Acknowledgements
This work has been supported by the European Union’s Horizon 2020 project PRECEPT, under Grant agreement No. 958284.
About this supplement
This article has been published as part of Energy Informatics Volume 5 Supplement 4, 2022: Proceedings of the Energy Informatics. Academy Conference 2022 (EI.A 2022). The full contents of the supplement are available online at https://energyinformatics.springeropen.com/articles/supplements/volume5supplement4.
Funding
This paper is part of the funded project PRECEPT (No.958284) by the funding agency European Union’s Horizon 2020 Framework.
Author information
Authors and Affiliations
Contributions
NZ analyzed the dataset and prepared the split of the data set for training/testing; led the implementation of the algorithms and the evaluation of the models. PF designed the research and the experimentation procedure; analyzed the results and made a major contribution to the writing of the manuscript. MP implemented the regressive algorithms, performed the training of the algorithms and the evaluation. NOPV implemented procedure for the identification of the period on the data sets, implemented the statistical algorithm, performed the training of the algorithm and the evaluation. SLHG contributed to the analysis of the data and design of the experiments; collaborated with the training of the algorithms and prepared the first draft of the document. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zangrando, N., Fraternali, P., Petri, M. et al. Anomaly detection in quasiperiodic energy consumption data series: a comparison of algorithms. Energy Inform 5 (Suppl 4), 62 (2022). https://doi.org/10.1186/s42162022002307
Published:
DOI: https://doi.org/10.1186/s42162022002307
Keywords
 Anomaly detection
 Time series
 Machine learning