Investigating the Performance Gap between Testing on Real and Denoised Aggregates in Non-Intrusive Load Monitoring

Prudent and meaningful algorithm performance evaluation is essential for the progression of any research field. In the field of Non-Intrusive Load Monitoring (NILM), performance evaluation can be conducted on real-world aggregate signals, provided by smart energy meters, or artificial superpositions of individual load signals (i.e., denoised aggregates). It has long been suspected that testing on these denoised aggregates provides better evaluation results mainly due to the fact that the signal is less complex. Complexity in real-world aggregate signals increases with the number of unknown/untracked load. Although this is a know performance reporting problem, an investigation in the actual performance gap between real and denoised testing is still pending. In this paper, we examine the performance gap between testing on real-world and denoised aggregates with the aim of bringing clarity into this matter. Starting with an assessment of noise levels in three datasets, we find significant differences in test cases. We give broad insights into our evaluation setup comprising three load disaggregation algorithms, two of them relying on neural network architectures. The results presented in this paper, based on studies covering three scenarios with ascending noise levels, show a strong tendency towards load disaggregation algorithms providing significantly better performance on denoised aggregate signals. A closer look at the outcome of our studies reveals that all appliance types could be subject to this phenomenon. We conclude the paper by discussing aspects that could be causing these considerable gaps between real and denoised testing in NILM.


I. INTRODUCTION
E FFECTIVE energy management in smart grids requires a fair amount of monitoring and controlling of electrical load to achieve optimal energy utilization and, ultimately, reduce energy consumption [1]. With regard to individual buildings, load monitoring can be implemented in an intrusive or non-intrusive fashion. The latter is often referred to as Non-Intrusive Load Monitoring (NILM) or load disaggregation. NILM, dating back to the seminal work presented in [2], comprises a set of techniques to identify active electrical appliance signals from the aggregate load signal reported by a smart meter [3].
Performance evaluation of NILM algorithms can be carried out in a noised or denoised manner, where the difference lies Submitted to peer review on August 18th. C. Klemenjak  in the aggregate signal considered as input. Whereas noised scenarios employ signals (i.e. time series) obtained from smart meters, denoised testing scenarios consider superpositions of individual appliance signals (i.e., denoised aggregates). Fig. 1 illustrates a selection of such real and denoised signals for three households found in NILM datasets. While a large proportion of contributions proposed for NILM is being evaluated following noised testing scenarios, exceptions to this unwritten rule can be observed [4]. The problem with this matter lies in the complexity of the test setup, as denoised aggregates are suspected to pose simpler disaggregation problems [5]. Consequently, the hypothesis claims that the same disaggregation algorithm applied to the denoised signal version of a real-world aggregate signal results in considerably better performance, thus communicating a distorted picture of the capabilities of the presented algorithm.
This paper presents a study with a focus on the difference of denoised and real-world signal testing scenarios in the context of performance evaluation in NILM. On the basis of test runs considering data of 15 appliances extracted from three datasets with considerably different noise levels, we strive towards bringing clarity on this widely disregarded question. We incorporate one basic as well as two load disaggregation approaches based on neural networks to obtain a broad understanding whether or not noise levels of aggregate power signals impact energy estimation performance. Finally, we discuss how the disaggregation performance is affected by signal noise levels with regard to different appliance types.
The remainder of this paper is organized as follows: We discuss related work in Section II. Section III introduces how noise levels can be measured in the context of NILM. Section IV introduces the experimental setup. We present the outcome of our studies in V. Section VI concludes the paper.

II. RELATED WORK
Despite the possibly far-reaching implications of this aspect for NILM, relatively little is understood about the actual performance gap between real and denoised testing. In [5], the hypothesis of denoised testing resulting in better performance was expressed first. Further, the authors introduce a measure to assess the noise level of aggregate signals. This measure has found application in a limited number of studies, in which the noise level was reported alongside the performance of load disaggregation algorithms on real-world aggregates [6], [7]. However, no comparison to the denoised testing case has  Fig. 1. An illustration of the difference between real-world aggregates and denoised aggregates been conducted. In [8], the noise levels of several NILM datasets were determined. The authors report basic parameters of several NILM datasets and find that noise levels in real aggregate signals vary significantly among observed datasets.
Few attempts have been made to evaluate NILM algorithms on both, real and denoised aggregates, such as presented for the AFAMAP approach in [9]. In subsequent work [10], an improved version of denoising autoencoders for NILM has been proposed by means of comparison studies to the state of the art at that time. Although the authors have not investigated the performance gap between real and denoised, a tendency can be derived for this particular case in both contributions, confirming the motivation for the studies presented in this paper.

III. ASSESSING SIGNAL NOISE LEVELS IN NILM
NILM has been approached in a variety of ways that can be categorized into event detection and energy estimation approaches [11]. In this work, we put an emphasis on the energy estimation viewpoint, as it can be seen as the precursor of the event detection stage in the disaggregation process. We define NILM as the problem of generating estimates [x ] of M electrical appliances at time t given only the aggregated power consumption y t , where the aggregate power signal y t consists of and a residual term η t . The residual term comprises (measurement) noise as well as the sum of unmetered electrical load [8]. To quantify the share of unmetered load in an aggregate signal, the noise-aggregate ratio NAR was introduced in [5]. This ratio can be computed for any type of power signal, provided that readings of the aggregate and individual appliances are available. A NAR of 0.15 reports that 15% of the observed power signal can be attributed to the residual term.
It can be observed that the NAR varies excessively across datasets and households, as Table I points out, where we summarize the NAR as well as further stats of several households found in the datasets ECO, REFIT and UK-DALE. Interestingly, the NAR ranges between a few percent, as it is the case for household 2 in the ECO dataset, and excessive 84.7% in household 5 of same dataset. Further, there are indications that the number of submeteres used in the course of the dataset collection can but do not necessarily have an impact on the noise level of the household's aggregate signal since it is decisive what kind of appliances are left out during a measurement campaign (low-power appliances vs. big consumers).

IV. EVALUATION SETUP
To gain a comprehensive understanding of the impact of noise on the disaggregation performance, we select three households with ascending levels of residual noise: household 2 of the ECO dataset [12] with a NAR of 5.9%, household 5 of the UK-DALE dataset [13] with a NAR 27.5%, and household 2 of the REFIT dataset [14] with a NAR of 65.1 %. This way, we incorporate one instance each for disaggregation problems with low, moderate, and high noise levels. For every household considered, we selected five electrical appliances and spent best efforts to consider a wide range of appliance types. We extracted 244 days for ECO, 82 days for UK-DALE and 275 days for REFIT while applying a sampling interval of 10 s. The amount of data used per household was governed by the availability in datasets, as can be learned from Table I. Datasets were split into training and test set, where the majority of data was employed to train the load disaggregation algorithms. We considered the smart meter signal as present in datasets and obtained the denoised version of the aggregate by superposition of the individual appliance signals following: For experimental evaluations, we utilized the latest version of NILMTK, a toolkit that enables reproducible NILM experiments [15], [16]. The toolkit integrates several basic benchmark algorithms as well as load disaggregation algorithms based on Deep Neural Networks (DNN). In the course of experiments, we consider one traditional and two approaches based on DNNs: The Combinatorial Optimization (CO) algorithm, introduced in [2], has been used repeatedly in literature to serve as baseline [15]. The CO algorithm estimates the power demand of appliances and their operational mode. Similar to the Knapsack problem [17], estimation is performed by finding the combination of concurrently active appliances that minimizes the difference between aggregate signal and the sum of power demands.
Recurrent Neural Networks (RNNs) are a subclass of neural networks that have been developed to process time series and related sequential data [18]. First proposed for NILM in [19], we employ the implementation presented in [20], which incorporates Long Short-Term Memory (LSTM) cells. Provided a sequence of aggregate readings as input, the RNN estimates the power consumption of the electrical appliance it was trained to detect for each newly observed input sample.
The Sequence-to-Point Optimization (S2P) technique, relying on convolutional neural networks, follows a sliding window approach in which the network predicts the midpoint element of an output time window based on an input sequence consisting of aggregate power readings [21]. The basic idea behind this method is to implement a non-linear regression between input window and midpoint element, which has been applied successfully for speech and image processing [22]. In a recent benchmarking study of NILM approaches, S2P was observed to be amongst the most advanced disaggregation techniques at that time [23].
In this study, we utilize two error metrics to assess the performance of load disaggregation algorithms. The first is a well-known, common metric used in signal processing, the Mean Absolute Error (MAE), defined as: where x t is the the actual power consumption,x t the estimated power consumption, and T represents the number of samples. The best possible value is zero and, as we estimate the power consumption of appliances, it is measured in Watts. As second metric, we incorporate a metric defined by NILM scholars in [24], the Normalized Disaggregation Error (NDE), defined as: In contrast to the MAE, the NDE represents a dimensionless metric and, more importantly, the NDE belongs to the class of normalized metrics. This allows for fair comparisons of disaggregation performance between appliance types [8].

V. RESULTS
Divided into subtables reporting MAE and NDE separately, we summarize the outcome of our investigations in Table II. For several appliances per household, we compare the disaggregation performance of CO, RNN, and S2P when applied to the real-world aggregate signal, denoted as Real, and the denoised aggregate signal Den, respectively.
In virtually all cases, we observe a strong tendency towards disaggregation algorithms providing better performance on denoised aggregate signals. This holds true for almost all households and appliances considered, though some exceptions were identified: we spot a few cases, namely the lamp in ECO, the dishwasher in UK-DALE, and the television in REFIT, showing the opposite trend for the CO algorithm. It should be pointed out that in those cases, the performance on the real-world and denoised aggregate signal shows a significant gap when compared to RNN and S2P. Therefore and because of CO being a simple benchmarking algorithm, we claim that these cases can be neglected. As concerns RNN and S2P, we identify a single contradictory observation, namely in the case of S2P when applied to the fridge in ECO's household 2. In this particular case, we observed that testing on the real-world aggregate signal results in marginally better performance. One explanation for this could be the extremely low NAR in this scenario, 5.9%, and the fridge belonging to the category of appliances with a recurrent pattern [23].
Having identified a clear tendency towards CO, RNN, and S2P providing significantly better performance in the denoised signal case i.e. lower MAE and NDE, we draw our attention to the open question whether or not there exists a link between noise level and the magnitude of the performance gap between Real and Den. To investigate further in this, we define the performance gap to be the distance between the NDE on the real aggregate signal and the NDE observed signal when testing on the denoised aggregate signal. We compute ∆NDE for the test cases presented in Table II and illustrate derived gaps for three appliance types in Fig. 2. We observe a positive correlation between noise level and the magnitude of ∆NDE for all three appliance types. This can be observed in case of the basic disaggregation algorithm CO as well as the algorithms based on neural nets RNN and S2P. Finally, the results obtained from testing on three households with considerably different NAR levels reveal that in the majority of test runs, testing on the denoised aggregate signal leads to substantially lower estimation errors and therefore, higher estimation accuracy, given the NAR is sufficiently high. A few cases showing the contrary trend were observed but can be reasonably explained. Interestingly, we not only observe performance gaps when estimating the power consumption of low-power appliances but also for appliances with moderate or high power consumption such as dishwashers and washing  Fig. 2b and Fig. 2c). As this apparent performance gap can be attributed to a variety of aspects, we suspect two of them having a decisive impact on this matter: First, denoised aggregates are obtained by superposition of individual appliance signals. As such, they contain fewer appliance activations and consumption patterns than aggregates obtained from smart meters, respectively. Particularly when estimating the power consumption of low-power appliances, such activations have the potential to hinder load disaggregation algorithm from providing accurate power consumption estimates. Such cases were repeatedly observed during our studies on REFIT, where a NAR of 65.1 % was measured.
Second, we observe a substantially higher number of false positive estimates in predictions based on real-world aggregate signals than in estimates generated from denoised aggregate signals. False positives in this context mean that the NILM algorithms predicted the appliance to consume energy at times this was not the case. Such false positives impact the outcome of performance evaluations two-fold, as they increase the disaggregation error and decrease the estimation accuracy of NILM algorithms, respectively. VI. CONCLUSIONS Motivated by the use of both, real and denoised aggregates in the evaluation of NILM algorithms in related work, we have investigated the performance gap observed between artificial sums of individual signals and signals obtained from real power meters. First, we utilized a noise measure, the noise-aggregate ratio NAR, to determine the noise level of real-world aggregate signals found in energy datasets. We find that noise levels vary substantially between households. We give insights on the experimental setup employed in our studies, comprising one basic and two more advanced NILM algorithms applied to data from three households with ascending noise levels. Our results show that in virtually all evaluation runs, a significant performance gap between the real and the denoised signal testing case can be identified, provided a sufficiently high noise-aggregate ratio. Though some exceptions can be observed, those cases can be well explained. Interestingly, our evaluation shows considerable performance gaps for common household appliances with a comparably high energy consumption such as dishwashers and washing machines. Hence, we claim that testing on denoised aggregate signals can lead to a distorted image of the actual capabilities of load disaggregation algorithms in some cases, and ideally, its application should be well-considered when developing algorithms for real-world settings.