Anomaly detection in quasi‑periodic energy consumption data series: a comparison of algorithms

The diffusion of domotics solutions and of smart appliances and meters enables the monitoring of energy consumption at a very fine level and the development of forecasting and diagnostic applications. Anomaly detection (AD) in energy consumption data streams helps identify data points or intervals in which the behavior of an appliance deviates from normality and may prevent energy losses and break downs. Many statistical and learning approaches have been applied to the task, but the need remains of comparing their performances with data sets of different characteristics. This paper focuses on anomaly detection on quasi-periodic energy consumption data series and contrasts 12 statistical and machine learning algorithms tested in 144 different configurations on 3 data sets containing the power consumption signals of fridges. The assessment also evaluates the impact of the length of the series used for training and of the size of the sliding window employed to detect the anomalies. The generalization ability of the top five methods is also evaluated by applying them to an appliance different from that used for training. The results show that classical machine learning methods (Isolation Forest, One-Class SVM and Local Outlier Factor) outperform the best neural methods (GRU/LSTM autoencoder and multistep methods) and generalize better when applied to detect the anomalies of an appliance different from the one used for training.

AD in temporal data series is the task of identifying data points or intervals in which the time series deviates from normality. AD finds application in different fields such as healthcare, where it applies to the analysis of clinical images (Schlegl et al. 2019) and of ECG data (Chauhan and Vig 2015), cybersecurity, where it is used for malware identification (Sanz et al. 2014), manufacturing, where it helps monitoring machines and prevent break downs (Kharitonov et al. 2022), and in the utility industry, where it supports the early identification of critical events such as appliance malfunctioning (Mishra et al. 2020) and water leakage (Seyoum et al. 2017;Muniz and Gomes 2022). In the energy field, AD may be combined with energy load forecasting to improve accuracy (Koukaras et al. 2021), or integrated as a component for detecting non nominal energy fluctuations for enhancing decision making in energy transfer between microgrids (An interdisciplinary 2021). Energy consumption time series can be collected from home appliances and building systems with complex periodic or quasi-periodic behavior, such as coolers, water heaters and fridges, which present specific challenges when performing anomaly detection. Machine learning and neural models trained on normal data may overfit with respect to the length of the period. This phenomenon makes the model sensible even to small variations of the cycle duration, which can happen during normal functioning (Liu et al. 2020). As a consequence, the detector may emit a high number of false positive alerts when such small variations occur and also may degrade its performances sensibly when used to detect anomalies of an appliance of the same type but with a different cycle duration.
The literature on AD in temporal data series still lacks a systematic comparison of algorithms belonging to different families on quasi-periodic data sets. Therefore the development of an AD application in such a scenario still has to confront with design decisions such as the choice of the most effective algorithm, the minimum duration of the time series to use for training, the minimum size of the signal prediction/reconstruction window needed to identify the anomalous behavior, and the portability of the chosen algorithm from one appliance to another one with "similar" behavior. This paper tries to fill the gap in the literature about AD in quasi-periodic time series by systematically comparing the performances of 12 algorithms representative of different families of approaches. The experiments were performed on 3 distinct data sets regarding the fridges power consumption.
The aim of the experiments is to address the following questions: • Q1 How do the selected algorithm compare in the AD task on quasi-periodic time series under multiple performance metrics? • Q2 For the algorithms that require training, what is the relationship between the length of the training series and the performances? • Q3 For the algorithms that exploit a window-based approach for the prediction, what is the relationship between the length of the window and the performances? • Q4 What is the generalization capability of the methods? How does performance degrade when a method trained on an appliance is tested on the time series produced by a distinct appliance of the same type?
The essential findings can be summarized as follows: • The classical ML algorithms Isolation Forest (ISOF), One-Class SVM (OC-SVM), and Local Oulier Factor (LOF) outperform the best neural models (GRU/LSTM autoencoder and multisteps methods) • Two weeks of training data are sufficient for most methods, with the multisteps approaches attaining a modest improvement if one month of data is used. • The length of the prediction/reconstruction window has a different impact on neural and non-neural methods. • ISOF and OC SVM are less dependent on the training set with respect to the neural models, which have a sensible performance decay when tested on an appliance different from the one used for training. • The top result of all the experiments is attained by ISOF on the Fridge3 time series, trained with a sub-sequence of length equal to one month and with a window size of 2 × period: Precision = 0.947, Recall = 0.965, F 1 score = 0.956.
The above mentioned findings can help understand better the requirements and performances of AD algorithms on quasi-periodic data series so as to design more effective household energy consumption applications, e.g., by equipping the mobile apps that are nowadays bundled with smart plug products with functionalities for consumption monitoring, energy saving recommendations and alerting of potential appliance malfunctioning. The rest of the article is organised as follows: Section "Related work" overviews the state of the art in anomaly detection. Section Experimental settings describes the experimental configuration, including the description of the dataset and of the evaluated algorithms. Section "Experimental results" discusses the results of the performed experiments. Section "Qualitative analysis of results" discusses qualitatively a few examples of the predictions made by the reviewed methods. Finally, Section "Conclusions" draws the conclusions and illustrates our future work.

Related work
Anomaly detection in temporal data series exploits data collected with a broad spectrum of sensors in diverse fields, such as weather monitoring, natural resources distribution and consumption (e.g., water and natural gas), network traffic surveillance, and electrical load measurement (Firth et al. 2017;A platform for Open 2022;Makonin et al. 2016;Shakibaei 2020). As an example, the work in Makonin et al. (2016) discusses the use of residential home smart meters for data collection and highlights how such series often exhibit anomalous behaviors. Raw data must be pre-processed to get ready for further analysis. Besides the usual operations of data cleaning and validation, a prominent task is data annotation, which associates data points or intervals with the specifications of significant events, such as change points and anomalies. For example, Rimor Rashid et al. (2018) is a time-series data annotator supporting the labelling of data with anomaly tags, which can be used as ground truth for training and evaluating predictive models.
AD can be conducted in both univariate (Braei and Wagner 2020) and multivariate time series (Su et al. 2019;Li et al. 2018;Blázquez-García et al. 2021). In the case of multivariate time series, exploiting variable correlation may be necessary for reducing the number of parameters needed to model the problem (Pena and Poncela 2006). Examples of multivariate time series dimensionality reduction techniques are principal components analysis (Cook et al. 2019;Pena and Poncela 2006), canonical correlation analysis (Box and Tiao 1977), and factor modelling (Pena and Box 1987).
AD approaches can be classified in two main families (Cook et al. 2019): non-regressive and regressive. Non-regressive approaches rely on the fundamental statistical quantities computed on the time series (e.g., mean and variance) and combine them with fixed thresholds, but their effectiveness is limited (Cook et al. 2019). The authors of Kao and Jiang (2019) proposed a statistical AD framework using the Dickey-Fuller test, the Fourier transform, and the Pearson correlation coefficient to analyze periodic time series. Performance evaluation on five NAB datasets (Ahmad et al. 2017) showed that the proposed approach performs well on the NAB Jumps periodic data set and outperforms the models it was compared to. Other types of non-regressive techniques are ML methods for time series analysis. In  the Local Outlier Factor (LOF) method was employed to identify anomalous events in the marine domain and attained 83.4% precision. The Isolation Forest (ISOF) algorithm has been applied to streaming data in Ding and Fei (2013), achieving an AUC score of 0.98 in one of the test dataset. In Zhang et al. (2008) the One-Class Support Vector Machines (OC-SVM) has been implemented for the identification of network anomalies, and for the test set, the outliers identified perfectly match the human visual detection result.
Regressive approaches compute a model of the time series generation process. In the case of AD, an autoregression model is used to forecast the variable of interest from its past values. Autoregressive models include methods based on Autoregressive Moving Average (ARMA) (Pincombe 2005;Kadri et al. 2016;Kozitsin et al. 2021) and on Neural Networks, such as Autoencoders (AE) (Yin et al. 2020;Li et al. 2020) and Recurrent Neural Networks (RNNs) (Canizo et al. 2019;Malhotra et al. 2015). Forecasting-based AD approaches are divided into single-step and multi-step methods depending on the number of predicted points. The former strategy is preferable for short-term forecasting (i.e., minutes, hours, and days) and the latter for long-term data series analysis.
In the electric load analysis domain, the work in Masum et al. (2018) studies the problem of time series forecasting for electric load measurements and shows that Long Short-Term Memory (LSTM), a deep learning model, outperforms AutoRegressive Integrated Moving Average (ARIMA), a statistical-based model, on three data sets obtained from the Open Power System Data on electric load in Great Britain, Poland, and Italy (A platform for Open 2022).  shows the importance of an Fast Fourier Transform (FFT) based periodicity pre-processor to extract the period in smart grids time series. Pereira et al. (2018) proposes the use of Variational Autoencoders (VAE) for the unsupervised anomaly detection in solar energy generation time series and the results show that the trained model is able to detect anomalous patterns by using the probabilistic reconstruction metrics as anomaly scores. Himeur et al. (2021) surveys several Artificial Intelligence methods for anomaly detection in buildings' energy consumption, identifying several factors (e.g., occupancy and outdoor temperatures) that influence time series behavior.
In the specific field of periodic data series analysis, Zhang et al. (2020) employs a periodicity pre-processor to find the time series period and segment the data into windows. Then it exploits a combination of an RNN and a CNN to detect anomalies achieving an F 1 score near 0.9 on all the test datasets.  also uses a periodicity preprocessor, based on the Fourier transform, and maps multiple periods onto a single cycle to identify deviations across subsequent periods. Pereira et al. (2018) uses Bi-LSTM to detect anomalies and proposes the use of attention maps to explain the results. Capozzoli et al. (2018) encodes periodic time series using letters as a data size reduction technique. The classification process led to robust results with a global accuracy that ranged between 80% and 90%. These works show the advantages of pre-processing to exploit the data periodicity and of dimensionality reduction techniques and discuss results interpretability.
The proliferation of time series analysis methods and of AD specific approaches has spawned a stream of research focused on comparing the performance of alternative techniques. For example, the work in Masum et al. (2018) compares the multi-step forecasting performance of ARIMA and LSTM-based RNN models and shows that the LSTM model outperforms the ARIMA model for multi-step electric load forecasting. Our preliminary work (Zangrando et al. 2022) compares CNN-powered and RNNpowered AD methods with One-Class Support Vector Machines and Isolation Forest techniques on one quasi-periodic data set, using standard metrics (precision, recall, F 1 score). In this paper we deepen the analysis assessing performances under multiple metrics, investigating the impact of the training sub-sequence duration and of the analysis window size, and contrasting the generalization capacity of the reviewed approaches.

Data set
The experiments exploit a fridge energy consumption data set collected using smart plugs. The energy consumption data have been collected in Greek residential households using the BlitzWolf BW-SHP2 smart plugs, which allow exporting the time series through an API. The data collection system, the assessed algorithms and the evaluation framework were all implemented in Python. The time series in the data set record the active power consumption of three fridges for over 2 months, with 1 minute data resolution. The time series have been divided into sub-sequences for training, validation, and testing of the methods. Table 1 summarizes the data split.
When working in normal conditions, the energy consumption curve of a fridge displays a cyclic behavior alternating between a high consumption state (ON) and a low consumption stage (OFF). Figure 1 shows an example of the consumption data of one appliance.

Data set analysis
Periodicity analysis Normal fridge consumption shows a cyclic behavior. Periodicity analysis aims at detecting the mean period corresponding to an ON-OFF cycle and possibly to other longer patterns (e.g., seasonal effects). It is a preliminary step before the application of AD and requires a non-anomalous sub-series, which can be created by manually removing anomalies from the training sub-sequence. The Fast Fourier Transform (FFT) is applied on the anomaly-free sub-sequence to map the data into the frequency domain and the periodicity is defined as the inverse of the frequency corresponding to the highest power in the FFT, as proposed in Kao and Jiang (2019). Table 2 summarizes the periodicity, expressed in minutes of the three data sets. The periods range from 45 minutes to 1h 40 minutes. No seasonal affect is found because the train set refers to only one month. Figure 2 shows the power spectrum computed for one of the three appliances.

Ground truth annotation
For training and testing purposes, the energy consumption time series have been annotated with ground truth (GT) metadata to specify the points that deviate from normality. Three independent annotators have labeled the data points, with a Boolean tag (normal/anomalous) and with a categorical label denoting the type of the anomaly, with the interface shown in Figure 3.
Anomaly classes and their distribution  The anomalies have been distinguished in the following categories: Continuous OFF state, when the appliance is in the low consumption state for a long time, Continuous ON state, when the appliance is in the consumption state for an abnormally long time, Spike, when the appliance has an abnormal consumption peak possibly preceded by a ramp and followed by a decay period, Spike + Continuous, when the appliance has a consumption peak followed by a prolonged ON state, Other, when the anomaly does not follow a well-defined pattern. Figure 4 shows the distribution of the anomaly categories in the data set of the three fridges. The plots highlight the different anomalous behavior of the appliances. Fridge2 is mainly subject to continuous ON cycles. Fridge 1 shows a similar pattern, but the prolonged ON states are preceded by an abrupt increase in the consumption. Fridge3 is subject to a more detectable anomalous behavior because almost 95% of the anomalies are of spike type, which are easier to detect also visually.
GT anomaly duration distribution. Figure 5 shows the GT anomaly duration distribution on the data series of the three fridges. The distributions of Fridge1 and Fridge2 are centered close the time series period, which suggests the presence of anomalies shorter than an ON-OFF cycle. The distribution of Fridge3 is centered around values higher Fig. 2 The power spectrum computed by the periodicity pre-processor (right) on the fridge energy consumption time series (left). The period detected for an ON-OFF cycle is about 80 minutes for the analyzed data set Fig. 3 The interface of the GT anomaly annotator at work on the fridge time series. The user can specify the anomalies and add meta-data to them. The user has annotated the currently selected GT anomaly, shown in red, with the Continuous ON state label than the mean ON-OFF cycle duration, which is typical of the transient behavior caused by high consumption spikes.

Algorithm list and definitions
The algorithm selection considered the most common methods used in the reviewed studies and their nature (statistical, regressive, neural) so as to achieve a balanced representation of the different approaches.
1 Basic Statistics is an extension of the method presented in Kao and Jiang (2019) for periodic series. The first step analyzes the anomaly-free training data series to determine the periodicity. Then, the anomaly-free train set is divided into non-overlapping windows of the same size as the period and the Pearson product-moment correlation coefficient is computed on all the pairs of contiguous windows to check whether the time series is periodic within the two windows. If it is periodic, the ratio R std = |Std current −Std previous | Std previous is computed. An anomaly occurs if R std exceeds a threshold τ , defined as follows. R std is calculated for each window pairs in the train set and the maximum value ( R max ) allowed in a non-anomalous time series is found. Then the threshold τ is determined on the validation set by performing a grid search. Given a set of possible thresholds τ α = R max (1 + α) , with α ranging from 0 to 10 with step 0.1, the threshold τ is defined as the value corresponding to the best F 1 score obtained by applying the anomaly definition rule on the validation set. Finally, the same rule is applied to the test set using the computed threshold value. 2 AutoRegressive (AR) (Hyndman and Athanasopoulos 2021) is an autoregression model exploiting past data to predict current data. The prediction model is defined as: where y ′ t is the differenced time series, ε t is a white noise term and c, φ i , θ j are the model parameters. Anomalous points are defined as in AR. 4 Local Outlier Factor (LOF) (Breunig et al. 2000) is a clustering algorithm based on the identification of the nearest neighbors and of local outliers. 5 One-Class SVM (OC SVM) (Schölkopf et al. 1999) is the use of support vector machine (SVM) for novelty detection. 6 Isolation Forest (ISOF) (Liu et al. 2008) is an ensemble method that creates different binary trees for isolating anomalous data points. (1)

Training procedure and parameter settings
The hyperparameters of the ISOF, OC SVM, LOF, and ARIMA models are set with Bayesian search employing the hold-out set method. For each configuration, the chosen hyperparameters are used to fit the model and the performances are evaluated on the validation set. LOF, OC SVM and ISOF are assessed using the maximum F 1 -score whereas the ARIMA models using the mean squared error (MSE) on predictions. The hyperparameters yielding the maximum F 1 or the lowest MSE are selected. ARIMA is trained on anomaly-free data to learn normal patterns as done in Yaacob et al. (2010).
ISOF, LOF and OC SVM work on spatial data and thus the univariate time series is projected onto a space R n with n ≥ 1 (Braei and Wagner 2020; . A window of size n is used to extract from the time series N − n + 1 vectors of length n of consecutive points, where N is the length of the time series. Then, the spatial algorithms are trained on the projected vectors. At test time, the test set is projected onto R n and the score of each projected vector is computed. The anomaly score of a point in the time series is defined as the average of all the anomaly scores of the vectors that contain the point. For all the neural models, training is performed on anomaly-free data. Table 3 summarizes the relevant features and parameters of the compared methods.

Anomaly definition, GT matching, and performance metrics
Anomaly definition strategies. An anomaly definition strategy specifies how the output of the anomaly detector and the data points of the time series are compared in order to identify whether a point is anomalous. AD algorithms adopt different strategies to identify abnormal points: • Confidence: an anomaly score is directly provided as output by the model.
• Absolute and Squared Error (Munir et al. 2018): the anomaly score is defined as the absolute or squared error between the input and the predicted/reconstructed value. • Likelihood (Malhotra et al. 2015): each point in the time series is predicted/reconstructed l times and associated with multiple error values. The probability distribution of the errors made by predicting on normal data is used to compute the likelihood of normal behavior on the test data, which is used to derive an anomaly score. • Mahalanobis (Malhotra et al. 2016): each point in the time series is predicted/reconstructed l times. For each point, the anomaly score is calculated as the square of the Mahalanobis distance between the error vector and the Gaussian distribution fitted from the error vectors computed during validation. • Windows strategy (Keras 2022): a score vector of dimension l is associated with each point. Each element s i of the score vector is the mean absolute or mean squared error of the i-th predicted/reconstructed window that contains the point.
A threshold τ is then applied to the calculated score(s) for classifying the point as normal or anomalous. Table 4 shows the anomaly definition strategies of the compared methods.
Anomaly detection criteria and thresholds. The criteria are the ones adopted in order to identify an anomaly. They are strongly related to the nature of the used algorithm. The anomaly identification criteria used by the compared methods are classified in: • Prediction error prediction models identify anomalies based on the difference between the predicted value and the observed one. Anomalies are identified based on the residuals between the input and the generated data: the higher the difference, the higher the likelihood of an anomaly. • Reconstruction error this criterion applies to all the models that aim at generating an output as close as possible to the input, such as the autoencoder-based models. As for the prediction models, the larger the residual, the higher the probability of an anomaly. • Dissimilarity dissimilarity models classify anomalous points by comparing them with the features or with the distribution of normal points or by matching them with the clusters computed from the normal time series. Table 4 summarizes the detection criteria used by the different algorithms.
GT matching To evaluate the predictions as true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN), a Point to Point matching strategy has been adopted: each anomalous point is compared only to the corresponding one in the input data series using the GT label.
Performance metrics The evaluation adopts the most widely used machine learning metrics, precision, recall, and F 1 score, defined as follow:

Experimental results
In this section we summarize the responses to the four questions introduced in the Introduction. For space reasons we condense the results of the 144 (12 methods × 3 training periods × 4 window sizes) experiments on 3 data sets and discuss only the essential findings. The complete list of results is published at the address: https:// github. com/ herre ra-sergio/ AD-perio dic-TS. Figure 6 shows the comparison of the methods over all the data sets and across all the training duration values and sizes of the sliding window. The ISOF method consistently achieves the best F 1 score, followed by OC SVM and LOF. The AE and MS neural methods have comparable performances. The multi-step approaches exhibit a more consistent behavior yielding smaller values of the standard deviation and the GRU-AE method performs slightly worse than the other approaches. The neural methods that predict only one point in the future (LSTM and GRU) have low performance and a rather inconsistent behavior. This is expected due to the high sampling frequency, which makes one step prediction ineffective to detect anomalies. Of the remaining non-neural methods, ARIMA and Basic Statistic are positioned at the low end of the performance range.

Q1: comparative performances
The top result on all the experiments is attained by ISOF on the Fridge3 time series, trained with a sub-sequence of length equal to one month and with a window size of 2 × period: Precision = 0.947, Recall = 0.965, F 1 score = 0.956.
A special case is that of AR. The training of the method converges only for the shortest duration of the training sub-sequence (a half period). However, the trained model delivers on average a good F 1 score. It can be observed that AR grossly fails in the accuracy of the predicted values but nonetheless the error of the points that belong to a normal sub-sequence is very different from the error of the points that lie within an anomalous sub-sequence, which results in good AD performances. Figure 7 shows the performance break down by appliance. As expected all methods, but ARIMA and Basic Statistics, perform better on the Fridge3 data set, which contains more recognizable anomalies mostly of a single type ( ≈ 95% of type spike). On the Fridge1 and Fridge2 data sets the performances follow the same ranking as in Fig. 6, with the same top-4 methods (ISOF, AR, OC SVM and LOF) and almost equivalent performances of the MS and AE methods. On the Fridge3 data set the methods that predict one step in the future (LSTM and GRU) work better. This analysis highlights that the performances of the models are affected by the considered appliance. Indeed, in Fridge1 the performances are more subject to variations, while in Fridge3 are more consistent. Moreover, ARIMA and Basic Statistics show low performances independently on the complexity of the dataset, which suggests their inadequacy for this kind of problem.
The results are in line with those of the work of Kharitonov et al. (2022) in which the authors compare the performances of alternative techniques to detect failures using manufacturing machine logs and observed that k-nearest neighbors (KNN) and LOF performed better, while autoencoders could not be considered for deployment in a  Elmrabit et al. (2020) found that classical machine learning techniques outperformed deep learning for the AD task in cybersecurity datasets. Figure 8 shows the variation of the F 1 metrics for the 10 methods that could be trained with all the three sub-sequences (2 weeks, 3 weeks, one month). The results show that the 2 weeks training period is sufficient for most of the methods. Only the multisteps (MS) methods attain a very slight average performance improvement if the training period length extends to 1 month. The results on the time series of Fridge1 and Fridge2 show a similar trend. All the detailed results can be found in the mentioned project repository. Figure 9 shows the variation of the F 1 metrics with the sliding window size (half a period, one period, two and three periods), limited to the 9 methods that could be trained completely. The results show a difference in the pattern between neural and non-neural methods.

Q3: window length
With ISOF and OC SVM the F 1 score decreases when the window size increases. With a value greater than half a period the methods progressively loose effectiveness: the variance increases and the F 1 score decreases. This is likely the effect of the worse trade-off between the noise and the context knowledge enclosed in the window. The AE methods deliver the best F 1 score when the window size equals twice the duration of the period. A similar trend is also displayed by MS methods, with LSTM-MS showing a slight monotonic increase up to the three periods. The one step neural methods GRU and LSTM are rather insensitive to the window size, but their performance is at the lower end of the range. The LOF approach exhibit the same trend as the AE and MS neural methods.
The value at the (2 × period) point of the neural methods shows that such a duration gives sufficient context for encoding the periodic features of the time series well and that going beyond that size is either counterproductive or yields a modest benefit. In the AE methods, the negative effect of the window size extension may be also due to the dimensionality reduction to a latent space operated by the neural architecture, Fig. 8 Variation of the F 1 score with the duration of the training sub-sequence. The AR and ARIMA method did not complete the training with all the periods which may become less effective when the dimension of the original space gets too large.
The results on the time series of Fridge2 and Fridge3 show a similar trend. All the detailed results can be found in the mentioned project repository.

Q4: generalization
The generalization experiments assess the top-5 methods (ISOF, OC SVM, LOF LSTM-AE and GRU-AE) on a dataset different from the one on which the methods have been originally trained. Each method is tested in two variants: the original version trained on the first appliance and a version in which the threshold value is finetuned on the validation data series of the target appliance. Figure 10 contrasts the F 1 scores obtained by the baseline version of the algorithm, i.e., the one trained and tested on the same dataset, the F 1 scores achieved by fine tuning the threshold on the validation set of the target appliance, and the F 1 scores Fig. 9 Variation of the F 1 score with the size (in periods) of the sliding window. The AR and ARIMA method did not complete the training with all the periods obtained without any fine tuning. The top performing method (ISOF) is also the one that generalizes best, even without fine tuning the threshold. In general, ISOF and OC SVM are less dependent on the training set with respect to the neural models, which have a sensible performance decay when tested on a different appliance. The degradation is more sensible when the test appliances is Fridge3, which has almost all anomalies of type spike, which are absent in Fridge1 and Fridge2.

Qualitative analysis of results
To get a qualitative appreciation of the different behavior of the best models, Fig. 11 directly compares the anomalies detected by ISOF, OC SVM and LSTM-AE with the GT anomalies. The detected anomalies are highlighted with a color that depends on the method and the GT anomalies are circled in red.
The plot on the left column show a situation in which all the three methods are able to detect more or less the same anomalous data points. The detected points match well the GT annotations. The plots on the right column show how the methods react to a change of the duration of the ON-OFF cycle (an acceleration in the displayed example, which may be caused by a different load of the fridge or by a change in the set point of the thermostat). Only the ISOF method is robust to such an occurrence. The other methods instead signal many normal points as anomalous, because they consider the entire cycle variation as an anomaly. Given that the time series of the appliances are quasi-periodic, as shown in the power spectrum of Fig. 2, the robustness with respect to small variations of the ON-OFF cycle is a very relevant benefit of the ISOF method. Fig. 10 Comparison of the generalization performance of the top-5 methods. The orange bar represents the baseline F 1 score (i.e., training and testing done on the same dataset), the blue bar denotes the F 1 score achieved by fine tuning the threshold on the validation set of the target appliance, and the green bar shows the performances obtained using the trained algorithm without fine tuning

Conclusions
In this paper we have discussed the results of the experimental comparison of 12 AD methods on three quasi-periodic data series collected with smart plugs connected to three distinct fridges. The comparison has first assessed the prediction performances, measured with the F 1 score metrics, which confirmed that the non-neural machine learning methods ISOF, OC SVM and LOF attain the best results, followed by the autoencoder-based and multi-step neural methods (GRU-AE, GRU-MS, LSTM-AE, LSTM-MS). In particular, the ISOF method trained with a sub-sequence of length equal to one month and with a window size of 2 × period attained a very good result on a fridge data series containing mostly spike anomalies (Precision = 0.947, Recall = 0.965, F 1 score = 0.956).
Next we evaluated the impact of the duration of the sub-sequence used for training the algorithms, which shows that the 2 weeks training period is sufficient for most of the methods and that the AR and ARIMA algorithms did not complete the training within reasonable time with time series of longer duration.
The impact of the sliding window size was also investigated. Non-neural machine learning algorithms require a shorter window (half of the period is enough), whereas neural models deliver the best performance with a larger window size (two periods in most cases).
Finally, the generalization ability of the top performing methods has been assessed too. The best method (ISOF) is also the one that preserves its performances intact when applied to a different appliance, even without fine-tuning the threshold on the target appliance.
Future work will further pursue the investigation of AD algorithms on quasi-periodic data series, focusing also on their runtime performance on hardware with memory and processing constraints. The objective is designing a timely, accurate and efficient system for dispatching mobile phone alerts about the potential malfunctioning of home appliances to real-world users.