Performance evaluation and comparison of NILM algorithms remain open research challenges for several reasons (Pereira & Nunes, 2018; Herrero et al., 2017). It is common practise that researchers evaluate their proposed NILM solutions on different datasets, with different criteria, and with the help of different metrics. From this follows that a direct comparison between two proposed algorithms is virtually impossible (Nalmpantis & Vrakas, 2018). To assess the validity of their proposed NILM approach, many researchers utilise the Zeifman requirements (Zeifman, 2012). These requirements serve to evaluate if a NILM method is applicable to home energy displays or smart meters and comprise requirements related to accuracy, real-time capabilities, need for training, scalability, etc. A requirements catalogue similar to the Zeifman requirements could serve as a guideline for a fair and meaningful performance reporting in the context of load disaggregation. To the best of our knowledge, such a requirements catalogue has not been proposed.
When comparing the performance of NILM algorithms, several aspects play an important role: datasets, metrics, and benchmarking tools. Energy consumption datasets are the outcome of measurement campaigns in households and industrial facilities. The aim is to not disrupt the everyday routines of the monitored space, so that the collected data resembles reality as close as possible (Pereira & Nunes, 2018). In order to enable reproducibility of results and comparison to other algorithms, researchers need to describe in detail the sections of the dataset that were used for training and evaluation and report the method applied to clean and pre-process the datasets (Makonin & Popowich, 2015). As in other data-driven approaches, the performance of NILM approaches highly depends on the datasets used for training and evaluation (Beckel et al., 2014). Therefore, detailed statistics of the utilised datasets should be made available alongside with a published approach. Beside commonly-mentioned aspects such as duration or number of appliances embedded in training data, researchers proposed reporting NILM-specific aspects. To the best of our knowledge, there is no exploratory study that investigates in the suggestions made by researchers with respect to dataset statistics and standardised performance metrics in an extensive manner by applying several state-of-the-art NILM algorithms to multiple energy consumption datasets.
In the recent past, machine learning approaches for NILM have attracted a lot of attention due to breakthroughs in research disciplines such as computer vision. The authors of (Kelly & Knottenbelt, 2015) are the first to evaluate the application of deep neural networks for energy disaggregation. Three deep neural network architectures are adapted for energy disaggregation. Experiments are performed against unseen household consumption data and against data seen during training. In the presented case study, the deep neural networks achieved better F1 scores than two reference models and that all three networks achieve acceptable performance when applied to an unseen house (Kelly & Knottenbelt, 2015). The authors point out that there are many open issues such as overfitting or unsupervised pre-training. A feasibility study on the development of a generic disaggregation model is presented in (Barsim & Yang, 2018). The authors demonstrate that their generic deep disaggregation model is able to achieve similar performance as state-of-the-art load monitoring approaches for a selection of appliance types. For single-load extraction, a fully-convolutional neural network with a fixed architecture and set of hyper-parameters was applied. Investigations such as presented in (Beckel et al., 2014) don’t consider machine learning approaches in their comparison case studies for NILM. Particularly with regard to recent suggestions of related work for improved comparability in NILM, the authors identify the need for an extensive comparison case study that considers well-established NILM algorithms based on Hidden Markov Models as well as novel machine learning approaches based on deep neural networks. Such an extensive comparison case study should evaluate the candidate NILM approaches on several datasets and consider suggestions made by related work such as the proposed performance evaluation strategy and disaggregation complexity.