Transformer training strategies for forecasting multiple load time series

In the smart grid of the future, accurate load forecasts on the level of individual clients can help to balance supply and demand locally and to prevent grid outages. While the number of monitored clients will increase with the ongoing smart meter rollout, the amount of data per client will always be limited. We evaluate whether a Transformer load forecasting model benefits from a transfer learning strategy, where a global univariate model is trained on the load time series from multiple clients. In experiments with two datasets containing load time series from several hundred clients, we find that the global training strategy is superior to the multivariate and local training strategies used in related work. On average, the global training strategy results in 21.8% and 12.8% lower forecasting errors than the two other strategies, measured across forecasting horizons from one day to one month into the future. A comparison to linear models, multi-layer perceptrons and LSTMs shows that Transformers are effective for load forecasting when they are trained with the global training strategy.


Introduction
Climate change is one of the biggest challenges facing humanity, with the risk of dramatic consequences if certain limits of warming are exceeded [1].To mitigate climate change, the energy system must be decarbonized.A difficulty in decarbonization is that renewable energy supply fluctuates depending on the weather.However, supply and demand must be balanced in the grid at every moment to prevent outages [2].In addition, with the ongoing decentralization of the renewable energy supply and the installation of large consumers, such as electric vehicle chargers and heat pumps, low-voltage grids are expected to reach their limits [3].Thus, to balance the grid and to avoid congestions, advanced operation and control mechanisms must be installed in the smart grid of the future [4,5].This requires accurate forecasts on various aggregation levels, up to fine-grained low-voltage level load forecasts [5,6].Such fine-grained load forecasts can be used for demand-side management, energy management systems, distribution grid state estimation, grid management, storage optimization, peer-to-peer trading, peak shaving, smart electrical vehicle charging, dispatchable feeders, provision of feedback to customers, anomaly detection and intervention evaluation [5,7,8,9,10].Moreover, the aggregation of fine-grained load forecasts can result in a more accurate forecast of the aggregated load [11].
With the smart meter rollout, fine-grained electrical load data will become available for an increasing number of clients.In such a scenario where load time series from multiple clients are available, different model training strategies are possible.The goal of our work is to compare training strategies for the Transformer [12], which was recently used for load forecasting [13,14,15,16,17,18].

Task definition
We address the following multiple load time series forecasting problem: At a time step t, given the history of the electrical load of C clients x c 0 , ...x c t with 1 ≤ c ≤ C, the goal is to predict the next h electrical load values x c t+1 , ..., x c t+h for all clients 1 ≤ c ≤ C, where h is called the forecast horizon.

Contribution
We compare three training strategies for the Transformer in a scenario with multiple load time series.The training strategies are depicted in Figure 1.We compare our models with the models from related work [19,20,21,22], as well as with multiple baselines.In particular, we compare with the linear models used in [23], to figure out if Transformers are effective for load forecasting and which training strategy is the most promising one.

Paper structure
First, we describe the Related work.Then, the Transformer architecture and the training strategies are described in the Approach.This is followed by the Experimental setup, Results and a Discussion.Finally, the paper concludes with the Conclusion and future work.

Related work
This section first presents related work on long time series forecasting and load forecasting with Transformers.Most of the load forecasting literature uses local models, but few works use global models, which are presented next.The global training strategy can be understood as a transfer learning technique.We therefore discuss transfer learning in the field of load forecasting at the end of this section.
As Transformer are often used for long time series forecasting with up to one month horizon, various extensions to the Transformer architecture exist that aim to reduce the time and space complexity.This is done by the Informer using ProbSparse self-attention [19], by the Autoformer using auto-correlation [20], by the FEDformer using frequency enhanced decomposition [21] and by PatchTST using patching [22].The proposed models are multivariate or local, except for the global PatchTST [22].The experiments in these works are conducted on six datasets from different domains, including one load forecasting dataset, which we also use in our experiments (see section Datasets).A global linear model called LTSF-Linear [23] gives better results than the aforementioned multivariate Transformers.Parallel to our work, global Transformers were shown to beat the aforementioned multivariate Transformers [24].However, this work does not optimize the model's lookback size and therefore achieves suboptimal results.PatchTST [22] is a global Transformer with patched inputs and is superior to LTSF-Linear [23] on the six datasets.
Transformer architectures for short-term load forecasting are designed to use external calendar and weather features [25,26].An evaluation of different architectures is undertaken in [14].Further work modifies the architecture for multi-energy load forecasting [25].Upstream decompositions are used to improve the forecast quality [27].These models are not compared on a common benchmark dataset, but evaluated on different datasets on city or national level.There, usually only one load time series is available, which only allows for local models.Furthermore, the models are not compared to the Transformer architectures for long time series.
Global load forecasting models are already used with convolutional neural networks [8] and N-BEATS [9].A mixture between a multivariate and a global model is investigated in [28], where a single recurrent neural network (RNN) model is trained on randomly pooled subsets of the time series.Some works cluster the time series and then train global or multivariate models for each cluster [29,30].PatchTST [22] is a global Transformer with patched inputs.We compare to this approach in our experiments.
The authors of [31] and [32] present current literature on transfer learning in the domain of energy systems.They define a taxonomy of transfer learning methods and discuss different strategies of using transfer learning with buildings from different domains.Two works [33,34] use transfer learning by pre-training and fine-tuning Transformers.Transferability from one building to another is tested in [33], and from one district to another in [34].In contrast to these works, our transfer learning approach is to train a generalized model on the data from many clients, without fine-tuning for a target time series.

Approach
We use an encoder-decoder Transformer [12] as a load forecasting model.This model architecture has self-attention and cross-attention as its main .and was initially used for machine translation.It was used as a forecasting model in [35] and later adopted for load forecasting [13,14,15].We use the model implementation from [14].
The encoder gets L vectors as input, which represent the last L time steps, where L is called the lookback size.Each input vector consists of one (in the case of local and global models) or C (in the case of multivariate models) load values, and nine additional time and calendar features.The features are the hour of the day, the day of the week and the month (all cyclically encoded with a sine and a cosine function), whether it is a workday, whether it is a holiday and whether the next day is a workday (all binary features).The input to the decoder consists of h vectors, which represent the following h time steps for which a forecast will be made.In the decoder input, the load values are set to zero, so that each value is forecasted independently from the previous forecasted values, allowing for a direct multi-step forecast

Experimental setup Datasets
As recommended in recent literature reviews on load forecasting [5,11,36], we conduct experiments on multiple datasets, namely the Electricity and the Ausgrid solar home datasets.For both datasets we make a temporal split and use the first 70% of each time series for training, the next 10% for validation, and the last 20% for testing, as in related work [20,21,22,23].
The Electricity dataset2 is published in [37] and used in related work on long-term forecasting [19,20,21,22,23].It is a subset of the UCI Electricity Load Diagrams dataset3 first presented in [38], only containing the time series without missing values.The dataset contains hourly electrical load data from 321 clients of a Portuguese energy supplier.The clients are from different economic sectors, including offices, factories, supermarkets, hotels, restaurants, among others [38].The time series range from 2012 to 2014.
The Ausgrid solar home dataset4 contains solar generation and electrical load data from 300 clients5 of an Australian energy supplier.The clients are private houses with rooftop solar systems.The time series range from July 2010 to June 2013.We only use the electrical load data transformed into hourly resolution.

Comparison methods
We compare our models with models from related work [19,20,21,22,23], as well as with a persistence baseline, linear regression models, multi-layer perceptrons and long short-term memory networks.
• Models from related work: For Informer [19], Autoformer [20], FEDformer [21], PatchTST [22] and LTSF-Linear [23], we take the results reported in the publications where applicable, and run the code published with the papers otherwise.All parameters except for the forecast horizon are left unchanged.
• Persistence baseline: The persistence baseline takes the value from one week before the predicted hour as a forecast for the 24 hours and 96 hours horizons, and the value from one month before the predicted hour as the 720 hours forecast.
• Linear regression: For each load time series, we train a linear regression model with h outputs.The input consists of the last 336 load values and the nine time and calendar features for the current hour when the prediction is made (see Approach for a description of the features).The main difference to LTSF-Linear [23] is that the linear regression models are local models, but LTSF-Linear is a global model.Furthermore, the two approaches use different training algorithms and LTSF-Linear does not use time and calendar features.
• Multi-layer perceptron (MLP): As for the linear regression, we train a local MLP for each load time series.The MLPs get the last 168 load values and the nine time and calendar features of the current hour as input.Using more than 168 load values as input did not improve the results.Each MLP has two hidden layers with ReLU activation [39] and 1024 neurons per layer.
• Long short-term memory (LSTM): We train multivariate, local and global LSTM [40] models.We use the same architecture as in [41], consisting of two LSTM layers with 20 units each and a linear prediction layer.Using larger models did not improve the results.

Training details
All models are trained with the AdamW optimizer [42] using the mean squared error loss.We use a batch size of 128 and a learning rate of 0.0001 with 1000 warm-up steps and cosine decay with γ = 0.8.When testing different lookback sizes L, we find one week to be optimal for the multivariate Transformer and the local Transformers.For the global Transformer, the results improve with increasing lookback size until L = 336 (two weeks), and stay almost the same for L = 720 (one month).For Transformer models with two weeks input and one month output, the batch size has to be reduced to 64 due to the quadratic memory consumption of the model.For the multivariate Transformer, the batch size is set to 32 as in related work [19,20,21].The validation error is evaluated every 10,000 training steps and at the end of every epoch.We use early stopping to end the training when no more improvement on the validation set is seen for ten evaluations.For the MLPs, the initial learning rate is set to 0.001 and decayed with γ = 0.5 after every epoch.

Metric
As in related work [19,20,21,22,23], every load time series is standardized by subtracting its mean and dividing by its standard deviation and the metrics are computed on these standardized time series.For every hour t ∈ T test in the test set, a forecasting model predicts the next h hourly loads ŷc t = ŷc t,t+1 , ..., ŷc t,t+h for time series c.Then, the mean absolute error (MAE) between the predictions ŷc = {ŷ c i ∀ i ∈ T test } and the ground truth y c = y c 1 , ..., y c Ttest is computed.As the final result, the MAE averaged across all C load time series, the T test evaluation time points and the h forecasting steps is reported.
The mean squared error (MSE) is computed analogously, using the squared residuals instead of the absolute residuals.

Forecast accuracy
Table 2 shows the MAE results on the two datasets

Computational cost
The training times are given in or the encoder-only architecture.Among the multivariate models, Autoformer [20] and FEDformer [21] give better results than the multivariate Transformer.It remains an open question whether these architectures are also better global models than the standard Transformer and PatchTST [22].Another promising architecture is the Temporal Fusion Transformer [26].In previous work with just one aggregated time series, the Informer [19] also gave better results than the Transformer [14].

Comparison with the state of the art:
The global Transformer achieves a better result for short-term forecasting on the Electricity dataset than related work [19,20,21,22,23], and achieves close results to the best results from PatchTST [22] for longer horizons and on the Ausgrid solar home dataset.However, to establish a state of the art for short-term and medium-term load forecasting, a comparison to other forecasting models must be undertaken, including models that are not based on the Transformer architecture and that are more sophisticated than our baselines.Using weather data could improve the forecasts, because some electrical load patterns, such as the usage of electrical heating, are weather-dependent.
Weather features could affect which model gives the best results, because some models might be better in capturing these dependencies than others.
Linear models: As in related work [23], we observe that linear models are strong baselines.The linear regression is in five out of six cases the best local model and only outperformed Task complexity: For longer horizons, the global Transformer's performance compared to the linear models deteriorates.This can be due to the increasing complexity when the model forecasts many values simultaneously.We chose a direct multi-step forecasting model because good results were achieved with this procedure before [22,23].However, other multi-step forecasting procedures, such as iterative single-step and iterative multi-step forecasting [43,44], could be beneficial for long-term forecasting because they reduce the number of forecasted values per model run.
Transfer learning: According to the definition of transfer learning in [31], the global training strategy can be seen as a transfer learning method, because the model must transfer knowledge between different types of buildings with different consumption patterns.Pre-training on other tasks than forecasting or on less similar data from domains other than electricity, as well as fine-tuning for a time series of interest, could improve the results.An advantage of the global model is that it can be applied to new time series without retraining.In [15] it was shown that the Transformer generalizes better to new time series than other approaches, but the forecasts are still better when training data from the target time series is available.

Other forecasting tasks:
The Transformer model and the different training strategies are not designed for load forecasting in particular, but can also be applied to other forecasting tasks.We hypothesize that the global training strategy can also be beneficial for other datasets containing multiple time series with similar patterns.

Conclusion and future work
We compare three Transformer training strategies for load forecasting on two datasets with multiple years of data for multiple hundred clients.We show that the multivariate training strategy used in related work on forecasting with Transformers [19,20,21] is not optimal, and it is better to use a global model instead.This shows that the right training strategy is crucial to get good results from a Transformer.Our approach achieves better results than related work [19,20,21], and comes close to the best results from PatchTST [22].In particular, our approach gives better results than the linear models from [23] for one day to four days forecasting horizons, which shows that, with the right training strategy, Transformers are effective for load forecasting.However, simple linear models give decent results for both short-term and medium-term horizons and train much faster than the Transformers.
In the future, more sophisticated Transformer architectures could be tested with the global training strategy.A comparison to other forecasting methods could be undertaken, and weather data could be incorporated into the models to see how it affects the results.Experiments with other datasets and varying amounts of training data could show under which circumstances the global Transformer model is better than other approaches.Additionally, transfer learning from other tasks and datasets could be tested.

Figure 1 :
Figure 1: The three training strategies, with models depicted as networks.An example with three load time series, four days input and one day output is shown.(a) Multivariate: one model processes all load time series simultaneously; (b) local: separate models (blue, orange, green) process each load time series; (c) global: one model (black) processes all load time series one at a time.
Future work could experiment with different datasets with varying amounts of data to see how much training data is needed for the global model to surpass the local models.A compromise between local and global models could be established by first clustering similar time series and then training one global model per cluster.The cluster-specific models would have less training data than the global model, but could benefit from the training data being more similar.Potentially, the global training strategy could also be beneficial for other forecasting tasks than load forecasting.

Figure 2 :
Figure 2: Architecture of the Transformer forecasting model.The input and output dimensions differ for the multivariate model and the local and global models.The shown dimensions refer to the Electricity dataset with 321 clients.
1.A multivariate model training strategy, where a single model gets all load time series as input and forecasts all load time series simultaneously.
2. A local model training strategy, where a separate univariate 1 model is trained for each load time series.3. A global model training strategy, where a generalized univariate model is used to forecast each load time series separately.

Table 1 :
Training strategy details for the Electricity dataset with 321 load time series, 2.1 years training data and nine time and calendar features.For the local models, training data is the amount of training data per model.Training strategy models input size output size training data all values iteratively.The input vectors to the encoder and the decoder are first fed through linear layers to increase the dimensionality to the hidden dimension of the model d model .Both the encoder and the decoder consist of multiple layers with eight selfattention heads and the decoder layers have eight additional masked cross-attention heads.The global approach is a single model that generalizes for all load time series.The model gets one load time series as input and generates a forecast for that load time series.In contrast to the local models, only one global model is trained on samples from all load time series, and this model is used to forecast all load time series.This results in C times as many training data for the global model as for a local model.To generate forecasts for all C time series, the global model is used C times with the history of one load time series as input.
Finally, a linear layer transforms the h decoder output vectors into a forecast with h × 1 (for local and global models) or h × C (for multivariate models) values.We varied the number of encoder and decoder layers and the hidden dimension d model , and found three layers with d model = 128 to give the best results.The full model architecture is shown in Figure2.Training strategiesWe compare multivariate, local and global Transformers.The training strategies are depicted in Figure1and are further explained in the following.Details on the inputs, outputs, number of models and training data size for each training strategy are given in Table1.• Multivariate training strategy: In the input to the model, each time step is represented by a vector of size C + f , where C is the number of load time series and f is the number of calendar features.The model forecasts C values for the next h time steps, i.e. its output consists of h vectors of size C.A single model is used to forecast all time series simultaneously.• Local training strategy: Local models get only one time series as input and generate a forecast for this time series.In the input, each time step is represented by a vector with f + 1 entries for the f calendar features and the electrical load value.C separate models are trained for the C time series, each using the training data from one time series.• Global training strategy: 6. On the Electricity dataset, the global Transformer is the best model for the 24 hours horizon, and PatchTST is the best model for longer horizons.On the Ausgrid solar home dataset, PatchTST is the best model for all three horizons.The global Transformer beats the local Transformers and the multivariate Transformer across all tested horizons.On average, it reduces the error by 21.8% compared to the multivariate Transformer and by 12.8% compared to the local Transformers.Compared to the best local model, the linear regression, it reduces the error by 2.9%.Compared to the best multivariate model, FEDformer, it reduces the error by 15.4%.All multivariate models, including Informer, Autoformer, FEDformer and the multivariate Transformer, perform poorly and do not beat the persistence baseline with a lag of one week.The local linear regression models are slightly better than the global linear model, LTSF-Linear, on the Electricity dataset, but it is vice versa on the Ausgrid solar home dataset.The MLP is in five out of six cases a bit worse than the linear regression, with a 1.5% larger error on average.The local LSTMs are better than the local Transformers, but the Transformer is better as a multivariate model and as a global model (except for the one month horizon on the Electricity dataset).The forecast errors are lower on the Electricity dataset than on the Ausgrid dataset which is a more fine-grained dataset containing single private houses.

Table 3 .
The local Transformer models need by far the longest time to train.Their training time increases sharply with longer forecast horizons.The multivariate Transformer trains fast and is even faster than the MLPs for short horizons.Training a global Transformer is much faster than training the many local Transformers but takes longer than the linear regression, MLP and the multivariate Transformer.The LSTM trains always faster than the Transformer with the same training strategy.

Discussion Best Transformer training strategy:
On the two datasets, the global Transformer is superior to the multivariate and local Transformers.We hypothesize that this is a result of the larger number of training samples for the global model (see Table1).The Transformer benefits from more training data, even if the training data comes from different sources.The multivariate models on the other hand are prone to overfitting.PatchTST is the best model in five out of six cases.However, the difference to the global Transformer is small.This shows that the success of PatchTST is mainly a result of its global training strategy.Its improvement upon the global Transformer can be due to the patching mechanism, a better hyperparameter configuration,

Table 2 :
MAE results on the two datasets, with 24, 96 and 720 hours forecast horizon.MV = multivariate, L = local, G = global.The best results are highlighted in bold and the best results per training strategy are highlighted in italic.

Table 3 :
Training times in hours, measured on a machine with a Nvidia 3090 RTX GPU.MLP for the one day horizon on the Electricity dataset.No general answer can be given on whether the local linear regression models are better or the global LTSF-Linear is better, because each variant is better on one dataset.

Table 4 :
MSE results on the two datasets, with 24, 96 and 720 hours forecast horizon.The best results are highlighted in bold and the best results per training strategy are highlighted in italic.