Probabilistic forecast of electric vehicle charging demand: analysis of different aggregation levels and energy procurement

Electric vehicles (EVs) are expected to be vital in transitioning to a low-carbon energy system. However, integrating EVs into the power grid poses significant challenges for grid operators and energy suppliers, especially regarding the uncertainty and variability of EV charging demand. Accurate forecasting of EV charging demand is essential for optimal power system integration, yet previous studies have often only considered point predictions that are inadequate for risk assessment. Therefore, this paper compares different probabilistic forecasting models for the short-term prediction of EV charging demand at various aggregation levels, using a large and novel data-set of over 350,000 charging processes at more than 500 locations across Germany. The performance of both machine learning and deep learning methods is evaluated against a naïve benchmark model, and the impact of data availability on the forecasting models is investigated. Further, the paper examines the effects of forecast accuracy on energy procurement, which has so far received minor attention in the literature. The results show that machine learning methods such as Ada Boosting and Random Forest yield robust results with a normalized root mean square error of 0.42 and 0.41 and a mean absolute scaled error of 0.36 and 0.34 at the highest aggregation level. Furthermore, the results show the influence of different site compositions on the forecast quality and how many charging points are likely to yield a robust forecast. Energy and fleet managers can use the described method to reliably predict the required energy quantities for fleets of sufficient size and procure them at low risk.


Introduction
Electromobility is in the fast lane.Significant obstacles to electromobility, such as range anxiety, battery life, and sustainability concerns, have been overcome, or work is underway to remove them (ev.energy 2023;Recurrent Auto 2024;Regett 2020;Wohlschlager et al. 2022).Global sales of electric vehicles amounted to around 10 million in 2022, with expected growth of over 30%, corresponding to about 14 million vehicles in 2023 (OECD 2023).Ultimately, electromobility, in combination with smart charging, is helping us to integrate renewable energies into the system and thus reduce transport emissions vital to meeting the EU's climate neutrality objectives (Duscha et al. 2019).As a result, more electric consumers, such as electric vehicles (EVs) and heat pumps, will be added to the distribution grids.However, grid integration presents the energy sector with significant challenges (Gemassmer et al. 2021;Müller 2023).In contrast to electromobility's needed grid capacity and regulatory challenges, the flexible storage capacity in electric vehicles offers considerable potential for the system and the users (Müller 2023;Kern 2023).Today, electric vehicles can contribute their charging flexibility by integrating renewables through smart charging, reducing costs and CO 2 emissions.In the future, bidirectional electric vehicles can offer more flexibility by charging and discharging the battery.From an energy system perspective, this will happen in a cost-optimal way, with end customers, in turn, earning money or reducing their charging costs through the flexibility they provide (Kern 2023).
The following work is therefore dedicated to comparing different forecasting models for the short-term prediction of the charging energy demand of electric vehicles and the effects of forecast accuracy on energy procurement.
Since time series forecasting has been a significant study area, numerous prediction approaches have been created.It is common to refer to forecasting techniques as statistical or machine learning-based.Nonetheless, because most machine learning algorithms rely on maximum likelihood estimators, they are also statistical.Barker (2020) defines structured and unstructured and notes that both categories still require clarification.Prior knowledge of the forecast's target attributes is necessary for stochastic approaches.On the other hand, regression techniques are more data-driven and do not rely on prior information on the time series (Athiyarath et al. 2020).In the field of electric vehicles, many studies focus on predicting the energy demand of the battery, such as Shen et al. (2022), Mediouni et al. (2022), Chen et al. (2020), on the prediction of charging station occupancy, such as Ostermann et al. (2022), Aghsaee et al. (2023), Hecht et al. (2021), or the prediction of the charging load, using different methods and approaches.
For example, Xydas et al. (2013) proposed a support vector machines (SVM) model to forecast the EV charging demand using travel and driving patterns.They evaluate the accuracy of their method through a Monte Carlo forecasting technique and show that their SVM model has a mean absolute percentage error of 3.69% compared to 8.99% of the Monte Carlo model.Yi et al. (2022) use a long short-term memory (LSTM) model and a deep learning sequence-to-sequence (seq2seq) approach to forecast the monthly commercial EV charging demand.They use real-world datasets from the State of Utah and the City of Los Angeles to validate their models, showing that the seq2seq significantly outperforms other models performing multi-step prediction (Yi et al. 2022).Zhu et al. (2019) compare different deep-learning approaches to forecast the super-shortterm charging load of plug-in EVs.Their results of twelve examples on several time steps demonstrate that deep learning methods, primarily LSTM, obtain high accuracy in super-short-term plug-in EV load forecasting (Zhu et al. 2019).As power suppliers do not have information about factors affecting a single car, such as car type, SOC, drive behavior, and destination, Kim and Kim (2021) focused on forecasting daily energy consumption using historical charging data, weather, and day effects.They use statistical methods such as auto-regressive-moving-average (ARMA) and autoregressive integrated moving average (ARIMA) and deep learning methods like the LSTM model using past values and exogenous variables.Kim and Kim (2021) studied the importance of features on three different geographic scales.At the same time, the discrepancies between the statistical and machine learning approaches were not distinct in the case of microscale data with high variability (Kim and Kim 2021).Xie et al. (2011) use neural networks to forecast daily EV charging station load by training the model using similar historical data days.Majidpour et al. (2016) compare forecasting of the EV charging load based on customer profiles and charging station measurements and show that both datasets yield comparable forecasting errors.The Customer profile-based prediction is faster due to less preprocessing.However, this data is prone to privacy invasion (Majidpour et al. 2016).Their modified pattern sequence-based forecasting model has a symmetric mean absolute percentage error of 6.28% for the charging record and 7.85% for the station record.Besides models such as ARIMA and LSTM, Koohfar et al. (2023) 2021) explore a hierarchical probabilistic electric vehicle load forecasting approach at low-level and high-level resolutions.Using real charging data, they demonstrate that their approach outperforms non-hierarchical methods in hour-ahead and day-ahead forecasting EV energy consumption and increases the skill of probabilistic forecasting up to 9.5%.Rathore et al. (2023) use various machine learning models such as Random Forest, XGBoost, and neuronal network models to predict energy consumption by using the historical charging data of the EVs.Their RF and XGBoost models yield the best predictive results.
Energy and fleet managers are faced with the challenge of predicting the quantities of energy required to charge electric vehicles and subsequently procure them as cheaply as possible.However, previous studies have often only considered point forecasts, which need to be revised for a risk assessment.For energy and fleet managers and grid operators, for example, probabilistic forecasts can be advantageous as they show how confident the model is in its prediction.This information can be incorporated into the risk assessment.More than simply predicting the energy quantities probabilistically is required, as they also have to procure them as economically as possible on spot markets.The following paper, therefore, examines different procurement options based on the forecasts and considers the effects of forecast inaccuracies, which still need to be sufficiently addressed in the literature to date.The basis for this work is a new data set with over 350,000 charging processes at more than 500 locations across Germany.So far, the literature has mostly only considered models for small or large fleets and the prediction of point forecasts.This paper compares models based on different charging point numbers and geographical aggregation levels and evaluates the prediction quality based on point and quantile forecasts.To ensure the comparability of the models, we use a naive benchmark model and relate our metrics to its results.In addition, to evaluate the probabilistic predictive quality of the models, we give the pinball score (PS) and the interval score (IS).The forecast is done for the next 24-h horizon with a resolution of 15 min.In addition, we use walk-forward validation to build robust models that come close to a real-world application.Furthermore, the influence of a shortened data set on the models is examined.In addition, a random composition of the sites is analyzed to provide information on when a group composition yields better prediction results.A detailed analysis of the characteristics of these group compositions is performed.Therefore, we examine the effects of forecast inaccuracies on energy procurement and different procurement strategies in detail.
The paper is structured in the following way: in "Materials and methods" section, we describe the data set and the features used.Further, we explain the methodology of how we developed our models and which metrics we used to evaluate their performance on the task of predicting the charging load.Subsequently, the methodology for analyzing the effects of forecast inaccuracies on energy procurement is presented."Results" section details the model performance results for various aggregation levels, different training lengths, the effects of random site aggregation, and energy procurement.Finally, "Discussion and conclusion" section presents the discussion of the results and the conclusion.

Materials and methods
We use data for our analysis from the charging and energy management system Charge-Pilot, developed by The Mobility House.The data consists of over 350,000 charging sessions from over 500 locations or sites.The data begins on 01.06.2022 and covers almost 1 year of charging sessions until 06.05.2023.However, not all sites contain a year's data, as they have only been added over time.Figure 1 shows the methodology for the first part of the paper.
The raw data consists of the attributes listed in Table 1.Not all obtained attributes are listed, only the ones essential for the analysis.
First, we check for errors in the data, such as the plug-out time before the plug-in time, unplausible charging powers, or missing values.However, the data did not show any of those errors.Next, we transform the charging sessions into a time series format of 15-min resolution per charger while we round the plug-in and out time to the nearest quarterly hour.We assume that the electric vehicle is charged at its maximum charging power upon being plugged in, and the charging power is then reduced to zero once the targeted energy consumption is reached.Adding up the time series of every charger  from one site yields the 15-min resolution charging power time series per site, which serves as our target variable.We use the following features for each of these as input for the models described in Table 2.
The first two added features facilitate identifying trends and patterns in the data over a weekly or daily time frame while accounting for the specified lag.The third and fourth features help capture trends and patterns in the data over specified time intervals.We initially included several lag features.However, correlation analysis has shown that the ones in Table 2 had the highest correlation regarding the target variable.If the corresponding date is a public holiday in Germany, the feature is assigned 1; otherwise, it is 0. Furthermore, we extracted the days of the week from the timestamps (Mon-Sun) and used one-hot encoding to convert the categorical variables into numerical ones.Jump discontinuities are a problem for machine learning algorithms using cyclical data.Therefore, we took cyclic feature encoding into account for periodic patterns in the time of year and time of day features in the final step of data preparation.A simple method is dividing the features into sine and cosine parts.Since we use a rolling 1-week feature, the input data for the models starts on 08.06.2022.We used German weather data as  an input feature on a subsample of the data.However, this did add additional value nor showed forecast accuracy improvement.Other studies suggest using weather data as an input feature.Therefore, using regional weather data might further improve the forecast.
To compare the effect of various fleet sizes and aggregation levels on the prediction quality, we aggregate the time series per (A) site on (B) postal code, (C) TSO zone, and (D) portfolio, meaning all combined levels.For (A) and (B), we chose five different sites and five postal codes.The five sites have 3, 4, 8, 14, and 145 charging points.Further, for (E), we randomly sample from all sites to investigate the effect of random group compositions and different fleet sizes further.This is done for various group sizes in the range 10, 15, 20, 25, 30, 40, 50, 75, and 100.The random sampling is done 100 times per group size.
Next, we split the data into training, validation, and test sets in the following ratios: 75%, 15%, and 10%.To investigate the effect of less available data, we limit the length of the data set to the following starting dates: 01.09.2022, 01.12.2022, and 01.02.2023.By manually setting the start of the test set to 04.04.2023, we ensure that the models trained with a shortened data set are compared on the same test set.However, this analysis is limited to the aggregation levels (A)-(D) due to computational restrictions.Classical tree-based machine learning models could be better at extrapolating unseen data.To account for this, we normalize the charging power per charge point by dividing it by the number of charge points to accommodate the trend or increase in the energy charged by additional charge points and sites.In addition, this allows us to visualize the charging power, as otherwise, there would be concerns about commercial confidentiality.
As a benchmark model, we use a naïve model (Naïve WD Mean), which takes the average of the weekday in the according quarterly hour.We use the following machine and deep learning models for our analysis: Linear Regression (LinR), Bagging, Gradient Boosting (GradientB), Ada Boosting (Ada), Random Forest (RF), convolutional neural network (CNN), neural network (NN), and long short-term memory (LSTM).The underlying concepts of the models are described in detail in Breiman (2001), Freund and Schapire (1996), Friedman (2002), Hatalis et al. (2017), Wang andRaj (2017), Sharkawy (2020).The models were selected to encompass various machine learning and deep algorithms.While LinR represents a relatively simple linear model, RF and Bagging are non-linear tree-based ensemble learning techniques.Further, we include the non-linear tree-based models Ada and GradientB, which use boosting.The NN architecture represents one of the simpler deep learning models.CNNs were first developed to analyze pictures; they can also be used to predict time series.Due to their particular architecture, LSTMs know when to memorise and when to ignore past information and therefore, are widely used in time series forecasting.We used the Python sci-kit learn implementations for the machine learning models and implemented the deep learning models in PyTorch (Pedregosa et al. 2012;Paszke et al. 2019).Deep learning models have the innate capacity to recognize and retain patterns over a wide range of time scales, unlike typical machine learning models that could depend on manually designed lag characteristics to account for temporal patterns.Because of several attributes, deep learning models can independently manage temporal dependencies and create additional features.The basic NN model comprises three linear layers with dropout applied after the initial layer.ReLU functions are activation functions between the layers, a pattern retained in subsequent models.The LSTM model features an LSTM layer with dropout, succeeded by two linear layers.In the CNN model, two convolutional layers with a kernel size of four are followed by a max-pooling layer, concluding with two linear layers.
Due to the temporal structure of the data, hyperparameter tuning-a critical step in maximizing the performance of machine learning models-becomes more difficult in the context of time series forecasting.In time series forecasting, the walk-forward validation technique is frequently employed to mimic real-world situations in which the model is trained on past data and subsequently evaluated on future data points.First, the model is trained using historical data.As our prediction horizon is 24 h, we predict the first day of the validation set and compare the predicted value with the actual value for the current time step using the performance metric mean squared error.Next, we move the time window to 24 h and update the training set with the actual values.For the subsequent time step, we repeat the training and prediction procedure and follow this step-by-step procedure, updating the model iteratively and assessing its effectiveness at every turn.This procedure is also used for the testing.We apply different combinations of hyperparameters using a grid search by validating each set of hyperparameters through the walk-forward validation process, calculating the average performance across all time steps.The combination of hyperparameters that produces the best overall performance is then selected.We test our final model on the test set not used for hyperparameter optimization to assess the model's generalization performance.The hyperparameters used for the grid search are listed in Table 5. Due to computational limitations, we apply this procedure only to aggregation levels (A)-(D).Furthermore, the deep learning models are not subjected to hyperparameter tuning due to computational costs, limiting their full potential.We use the 5 and 95% quantiles to calculate the quantiles.In boosting models that optimize individual estimators, quantile predictions were derived by extracting quantiles from the estimators.This approach is also suitable for Ada, where the quantiles need to consider the weights of the estimators.However, this quantile estimation method is not feasible for models optimizing the entire ensemble based on a specified loss function, such as Gradient Boosting.In these instances, the model must be trained with a different loss function, and the pinball loss was chosen for making quantile predictions.This is true for deep learning models as well.
We evaluate the models based on the following evaluation metrics: root mean squared error (RMSE), normalized RMSE (nRMSE), mean absolute error (MAE), mean fundamental scaled error (MASE), R 2 , pinball score (PS), and interval score.
The Mean Absolute Error (MAE), is one of the most used error measurements and is referred to as (Hyndman and Athanasopoulos 2021): where n is the number of observations and y i defines the actual value, and ŷ i is the model's prediction.Another frequently used metric is the RMSE, which is defined as (Hyndman and Athanasopoulos 2021): (1) The nRMSE is a frequent statistic for assessing a predictive model's accuracy, especially in regression analysis or forecasting.It offers a relative measurement of the error between expected and actual values and is a normalized form of RMSE.The MASE is often used to evaluate a forecasting model's accuracy and is the normalized form of the mean absolute error.We normalize based on our Naïve WD Mean model, meaning that values above 1 are worse than the benchmark model and below one are better than the benchmark model.This makes it easier to compare our model's performances.The R 2 is frequently employed to measure the regression model's goodness of fit since it shows how well the model's predictions correspond with the actual data, where one is a perfect fit.Zero indicates that the model does not explain any of the variability in the target variable.According to James et al. ( 2021), R2 is defined as: where ȳ is the mean of the target.Let τ be the target quantile and q i,τ , the quantile fore- cast, then the PS τ, which evaluates the upper and lower quantile separately, and according to Koenker and Machado (1999), can be defined as: The PS is a metric that quantifies the difference between the actual and anticipated quantile value, weighted according to the quantile level.It evaluates the prediction interval's accuracy, with various quantiles generating distinct values.Better model performance is indicated by a lower PS, which penalizes deviations from actual values within the predicted quantile range less severely.Another metric to assess probability forecasts is the Interval Score (IS), which considers the width of the prediction interval, also known as sharpness (Hatalis et al. 2017).The IS is typically used with the PS to assess the prediction model's total predictive uncertainty because it cannot adequately characterize the dependability of the prediction interval (Hatalis et al. 2017).The narrower the interval and closer to the actual observations, the smaller the interval score.
Figure 2 shows the methodology for the second part of this work, in which we examine the effects of forecast inaccuracies and different trading strategies on energy procurement based on the German energy market.For the analysis, we use the German hourly day-ahead energy price, and for the intraday prices, we use the ID1 and ID3 prices from the ENTSO-E Transparency Platform (2024).
Based on the time series for the entire portfolio, we create a model for day-ahead (DA) procurement and two models for intraday procurement.The models differ in terms of their input features, the starting point of the forecast, the forecast horizon, and the resolution.In Germany, the day-ahead auction closes at noon for the next day, whereby hourly products can be traded.In continuous intraday day trading, 15-min products can be traded up to 5 min before the start of delivery.The DA model has an offset of 12 h to (2) (3) the start of the forecast, a forecast horizon of 24 h, and an hourly resolution.The dayahead model is designed so that it does not contain any lag features that provide the model with information from future observations, thereby preventing data leakage.The 12 h offset results from the gate closure of the day-ahead market, where we assume that the forecast and actual trading are instantaneous.The hourly resolution results from the hourly traded products.Thus, the day ahead model represents the best possible forecast to buy the hourly products for the next 24 h on the day ahead market.The intraday models have (a) a lag of 1 h, a forecast horizon of 1 h, and (b) a lag of 15 min and a forecast horizon of 15 min.Both intraday models have an additional 1-h lag feature as input and a resolution of 15 min.The intraday model b was chosen because it represents the best possible model with a resolution of 15 min.Since our time series is based on a 15-min resolution, it makes no difference whether we assume a 5-min or 15-min lag, as the last point in time or the last actual value is available to the model as information.The intraday model a was chosen to match the day ahead's hourly products with the hourly forecast horizon and to have a worse comparison model than b.Since the model has a lag of 1 h, it has less information available than the b model.The energy procurement process is as follows.We buy the energy amount forecasted from the DA model for various quantiles for the respective hours for the next day to the given day-ahead price.Next, we buy or sell (a) 1 h or (b) 15 min before delivery, depending on the intraday model a or b the difference from the predicted intraday value compared to the DA forecast to balance the energy according to the ID1 and ID3 intraday-price.The ID1 index is the weighted average price of all continuous trades completed within the last trading hour, and for the ID3, the previous 3 h are taken into account.Thus, we get the total energy procurement costs on the spot market for the energy per charge point in the portfolio during the test set.Afterward, we compare the actual values with the predicted ones to assess how much balancing energy would be needed to cover the difference.The balancing group managers in Germany are obliged to minimize their balancing group deviation.They must refrain from strategically exploiting the balancing energy; otherwise, there is a risk of high penalties.Therefore, we only examine the effects on the amount of energy that must be balanced by the balancing energy, which should be as low as possible and not on the associated costs or revenues resulting from the balancing energy price and the amount of energy.However, in other European Countries, incorporating the balancing market into the strategy would be possible.

Results
The first part of this section reports the results of the various models based on the different aggregation levels.The second part of this section describes the outcomes of the energy procurement.

Model performance for various aggregation levels
This section compares the results of the different forecasting models described in the previous sections.The presented results correspond to the model's performance on the test set.The test set starts on 04.04.2023 at 00:00.It ends on 06.05.2023 at 23:45.The predictions are made at midnight with a 24-h horizon and 15-min resolution.Figure 3 shows the power per charging point in kW for (A) a random site, (B) for all sites in the zip code, (C) in the TSO zone, and (D) aggregated for all sites during the test set.
The power per charging point for the site displayed on top has high peaks during the day on the weekdays.During the night and the weekends, no charging events occur.The site shown in A, is also dominant at the ZIP code level B, as the time series are similar but not identical.For example, small charging events can be seen on Mondays and Sundays.The aggregation at the TSO level differs significantly from that at the ZIP code level, where charging power is already very regular.Comparing the TSO zone with the aggregation of all sites, it is noticeable that the charging power is smoothed out even further in the afternoons and on the weekends.Further, the factor of simultaneousness decreases with higher aggregation.Assuming an average charging capacity of 11 kW, the factor of simultaneousness is around 10%. Figure 4 shows the actual power per charger in W for the entire portfolio in yellow, the predicted power of the Ada model in blue, and the 95% quantile in light blue for the test set.Overall, the forecast follows the real power, and the 95% quantile also follows the real power with a low dispersion with a few exceptions.Large quantile deviations on 10.04.2023 and 01.05.2023 are noticeable.Nevertheless, the prediction is close to the actual value.This can be explained by the fact that both days are national holidays in Germany, and this information is available to the models as a feature.However, as we have no annual data available, the model is uncertain, as can be seen from the deflection of the quantile.
Figure 5 shows the nRMSE (top) and the MASE (bottom) as a boxplot for the different aggregation levels and the various models.The models are arranged from left to right, as shown in the legend from top to bottom.
Almost all models have values above 1 for the MASE and nRMSE for both the individual sites and the zip codes, which means that the benchmark model is better in some cases.Looking at the MASE and the nRMSE of the models for the TSO zones and the entire portfolio, it becomes clear that aggregation significantly increases the prediction The ensemble models Bagging, Ada, and RF excel in accurate point predictions, reflected by low nRMSE and MASE, and exhibit robust quantile prediction, as evidenced by the comparatively low PS.The Ada model has the lowest rRMSE with 0.355, the lowest PS for the high Quantile with 2.767, and the RF the lowest MASE with 0.411 and R2 with 0.954.Furthermore, these three provide narrower prediction intervals, as indicated by lower IS.The three deep learning models demonstrate moderate performance in point prediction, with the CNN having the lowest amongst them with 0.538, which is significantly better than the benchmark and better than the LinR.However, they are worse than the ensemble models, indicating a potential need for further refinement in capturing underlying patterns.Since we have chosen our hyperparameters to cover as wide a range as possible, but not every model has an oversized hyperparameter space, some models may be at a disadvantage.Specifically, looking at Table 5, this can be seen in the number of estimator parameters; for example, our RF model has a range of 500, 750, and 1000, GradientB 50, 100, 150, 200, 250, 300, 350, and Ada 10, 50, and 100.Large numbers of estimators often lead to overfitting.The models may have yet to reach their full potential and can achieve even improved results depending on the aggregation level.The fact that Ada and RF performed best in the overall portfolio does not necessarily mean that these two are generally the best.In particular, a variety of test pipelines that test different feature compositions, such as weather data, different lag features, district specific holidays, but also, for example, different encoding strategies for the cycling time features and weekdays, in combination with larger hyperparameters, would possibly improve the performance of all models.But these steps were limited by the lack of computational power.Further, the deep learning models face challenges in accurately predicting quantiles, reflected by higher PS.As mentioned in "Materials and methods" section, we did not fully fine-tune and optimize the deep learning architectures; therefore, doing so may enhance their predictive capabilities.Comparing the results of the individual sites (see Table 5), it is noticeable that only the values of site E with over 100 charging points are significantly better than those of the naive model.While sites A, B, and D all have a MASE greater than 1, some models at site C achieve a better value than the benchmark.It is interesting to note that, on the one hand, the site has fewer charging points than site D. On the other hand, the CNN model for the MASE and the LSTM for the nRMSE achieve the best values, although they perform worse than the machine learning models at the other aggregation levels.
In conclusion, Bagging, RF, and Ada perform best; they are particularly good at quantile estimation and point prediction with narrow prediction intervals.The deep learning models (LSTM, CNN, and NN) show moderate performance with room for improvement, especially regarding quantile prediction and narrow prediction interval widths.Comparing the results of Tables 6, 7, 8, finer aggregation levels (like zip code and site level) tend to pose more challenges for the models.The ensemble models Bagging, GradientB, and Ada maintain their robust performance across different aggregation levels, while RF shows performance degradation, indicating challenges in handling finergrained data.The same is true for the deep learning models, as they exhibit more sensitivity to data granularity.The choice of the best-performing model may depend on the specific aggregation level and the trade-off between computational efficiency and predictive accuracy.

Model performance for different training lengths
To investigate the influence of available data on the prediction performance, we train the models on different data lengths.Figure 6 displays the nRMSE (top) and MASE (bottom) as a boxplot for the models over all aggregation levels for different start dates of the training set.The start date 08.06.2022 represents the results for the complete data set described in "Model performance for various aggregation levels" section.As mentioned in "Materials and methods" section the test set is the same for all data lengths and models to ensure comparability.
Comparing the different start dates, it is initially noticeable that the deep learning models perform particularly poorly with a start date of 01.02.2023.All models perform better with more data, although the dynamics and behavioral patterns can change due to adding charging stations at individual sites.However, a certain saturation can be observed, as the minimum values of the models with the start date 08.06.2022 do not improve significantly compared to 01.09.2022.It would be interesting to investigate whether a further significant improvement occurs if more historical data is added, for example, to map seasonality.Unfortunately, however, we do not have more data.Shorter data sets can lead to overfitting, but the training is less computationally expensive and requires fewer computational resources.This is of particular importance when implementing real applications.Overall, it remains a case-by-case decision whether the simple benchmark model is superior to machine learning models in the case of limited historical data.

Analysis of random site aggregation
By forming random groups of different sizes from all the sites and aggregating them, we further investigate the influence of different fleet sizes and the number of charging points on the prediction quality.The random group sizes consist of 10,15,20,25,30,40,50,75, and 100 sites, and the random draw is repeated 100 times.We thus formed a total of 36,500 different group compositions.We formed random groups of different sizes from all the sites, aggregated them, and then used them as input for the models to investigate the influence of fleet size and the number of charging points on the prediction quality.We formed the random group sizes of 10,15,20,25,30,40,50,75, and 100 and repeated the random draw 100 times.We thus formed a total of 36,500 different group compositions.Due to the large number, we did not use hyperparameter tuning for the randomly composed time series.We only used the Ada model for the analysis based on the previous results, as it was among the best.Figure 7 shows the nRMSE (a) and MASE (b) for all different group compositions according to their number of charging points and frequency distribution.
Looking at the frequency distribution, it becomes clear that the benchmark model is only superior to the Ada model in a few exceptional cases when grouped according to the abovementioned quantities.Most groups have several charging points between 150 and 400 and a MASE or nRMSE of 0.8 to 0.55.Although the MASE and nRMSE drop significantly with increasing charging points to around 0.5, a few group compositions perform poorly despite many charging points.The group with a MASE and nRMSE of about 0.65 at 900 charging points is particularly striking.At the same time, however, random group compositions achieve an nRMSE or MASE of 0.55 or less with just around 200 charging points.
The question, therefore, arises as to which the remarkably predictable groups exhibit characteristics and whether these can be determined in advance.Charging point operators or aggregators could ensure their balancing groups are grouped to meet this characteristic.To investigate this question, we used a Wavelet analysis to examine the time series of a group composition with ten sites with an nRMSE of over one and one with an nRMSE of 0.55.Wavelet analysis is a mathematical method that breaks down signals or functions into their frequency components for analysis.Instead of conventional Fourier analysis, Wavelet analysis records both frequency and temporal localization.It analyzes data at various scales and reveals features at varied resolutions by using tiny, wave-shaped functions known as wavelets.This allows the identification of fleeting features in the data.Wavelet analysis is a potent tool for deciphering and obtaining information from complicated signals.It has applications in many domains, such as signal processing, image analysis, and compression methods.The wavelet analysis also has the advantage that the dynamics of charging behavior, which change to some extent over time, are visible.Figure 8 displays the resulting wavelet plot of the wavelet analysis while the y-axis depicts the frequency in days, and the x-axis represents the time for the group of tens with a nRMSE of 0.55 (a) and above one (b).The yellow areas highlight the occurrence at specific times and frequencies and illustrate how the frequencies contribute to the signal.
The left wavelet plot shows a strong periodicity of 1 day and 7 days.It can also be seen that the signal is weaker at the beginning and increases from September onwards.In addition, the period around Christmas is recognizable in which the periodicity visibly decreases.The right-hand wavelet plot initially shows little periodicity, especially around November 2022.From the end of January, an increased daily and weekly periodicity can be seen, but more clearly separated than in plot (a).This is because some of the sites were only integrated into the load management system at this time.The displayed wavelet plot thus enables a quick and easy visualization to analyze possible patterns, transients, and frequency components within the signal across different scales.The plot also shows that a shorter data set might lead to better results for the group composition (b).
It is advantageous for energy providers if the portfolio is as extensive as possible.However, energy and fleet managers require a location-specific forecast.In the case of a charging management system, a charging station-specific forecast is required.Whether a single location or a group composition can be predicted well can be estimated by looking at the wavelet plot, as described above.On the other hand, the correlation between the lag features and the charging load can be determined with the help of a correlation analysis.If these correlate strongly, it strongly indicates that the location or group composition can be predicted with improved accuracy.

Energy procurement
As described in "Materials and methods" section, we use two intraday models and a day-ahead model with a 12-h offset and 60-min resolution to examine energy procurement on the day-ahead and intraday markets.For this analysis, we use the median of the model predictions.The testing period is the same as mentioned in "Model performance for various aggregation levels" section.The day-ahead model has an MAE of 26.39 W/ charger and an RMSE of 53.30W/charger.The intraday model with an offset of 60 min (intra60) and resolution of 15 min has an MAE of 15.67 W/charger and an RMSE of 35.77W/charger, while the intraday model with an offset of 15 min (intra15) and resolution of 15 min has an MAE of 13.94 W/charger and an RMSE of 29.33 W/charger.
Figure 9 shows the day ahead price in red, the intraday ID1 price in blue, and the balancing price in orange.
The upper graph shows only the day-ahead and intraday price, while the lower graph also shows the balancing price, with the y-axis scaled differently.The day-ahead price shows no significant outliers and ranges between − 8.82 and 207.92 €/MWh.The intraday price, on the other hand, shows a significantly more extensive range of − 1323.6 and 595.71 €/MWh.Looking at the chart below, it is clear that the balancing price shows even more significant outliers.Here, the minimum value is − 6082.56, and the maximum is 9853.34€/MWh.For example, if a trader had bought the required charging energy on the day-ahead market on 10.04.2023 at 11:00, he would have received €8.82 for each MWh.On the intraday market, it would be €252.96at 11:00, and €1323.6 at 11:45 for each MWh consumed.At the same time, a MWh would have cost €65.54 on the day-ahead market and €14.71 on the intraday market at 17:00 on the same day.On 11.04.2023around 06:00 a.m., however, the day-ahead price is significantly below the intraday price.The price range here is just under €370/ MWh.At the same time, the balancing price was just under €690/MWh-every additional MWh required and not previously procured leads to considerable additional costs.The gap becomes even more extreme at 19:00, as the balancing price here is just under €7600/ MWh, which is around 54 times higher than the day-ahead price at the same time.If the forecast is significantly too low at these times and too little energy is procured, this leads to considerable additional costs.Procurement purely on the intraday market would lead to significantly higher costs here.These enormous fluctuations clearly show the potential of smart energy procurement and the additional flexibilization of loads, for example, by postponing charging processes.For the entire portfolio, the procured amount results in 144.96 kWh of charging energy per charger in the test period.First, we assume that we have perfect foresight and purchase all the necessary charging energy once entirely on the day-ahead market and once completely on the intraday market at the ID1 and ID3 prices.This results in the following electricity costs for charging per charger: day-ahead € 15.87, intraday ID1 € 16.24, and Intraday ID3 € 16.17.Table 4 lists the results.This makes it clear that for the test period, it would be most cost-effective on average to procure all energy on the day-ahead market if one knew in advance exactly how much energy one would need, which in reality is not the case when procuring charging energy for a portfolio.
In the test period, procuring as little energy as possible would be financially advantageous, as the balancing energy price is primarily negative.As explained in "Materials and methods" section, it is not permitted to systematically and deliberately exploit balancing energy to gain financial advantages in Germany.If this were permitted, it would be possible to speculate on the negative price peaks with a certain degree of risk by procuring very little energy at this time.The negative price peaks would lead to such large profits that it would be cheaper overall than procuring on the day-ahead market, which is valid for the test set and the whole data period.Therefore, in the following, we look not at the total costs, including the balancing price, but at the influence of the two intraday models on balancing energy, as the amount must be minimal.The costs for procurement on the day-ahead market due to the day-ahead forecast per charger are € 15.637.After the purchase or sale of the deviation from the forecast of the intra60 model, the additional  : 9.3, 8.8, 8.3, and 7.8.Thus, a reduction in the necessary balancing energy of 20% only increases intraday procurement by 3%.About the total costs, the additional costs are even less than 0.1%.In addition, this avoids the risk of compensating for the sometimes highly high balancing energy prices, such as on 20.04.2024.The average expected value of the forecast must be used for intraday procurement, as otherwise there is a strategic over-or under-procurement and thus an exploitation of the balancing energy price.To examine the effects of under-or over-procurement on the day-ahead market, we use the 5% and 95% quantile of the forecast of the day-ahead model and then procure the difference on the intraday market again.In this case, our analysis shows that it is more favorable for our predicted load energy in the test period to procure the lower quantile and then sell or buy the difference than to buy the upper quantile and then sell/buy the difference.However, the difference in the resulting total costs is less than 1%.In order to procure energy with as little risk as possible, energy and fleet managers should purchase the average forecast energy on the day-ahead market.Static procurement of charging energy results in lower costs on the one hand.On the other hand, they are not forced to buy or sell large quantities in the event of significant fluctuations in the intraday market.In particular, large quantities to be balanced out can exacerbate the price difference in situations with little liquidity.It is crucial to make the intraday volume forecast as accurate as possible because, as shown, the costs only increase marginally, and the balancing energy required decreases to the same extent as the improved forecast.Further, when taking flexibility into account the optimization and cost reduction potential increases substantially, by not only shifting the charging into times with low prices but also doing arbitrage trading.

Discussion and conclusion
This section discusses the potential limitations and directions for future work.As previous studies have shown, our results confirm that machine learning methods are suitable for predicting the charging load of electric vehicles and that the prediction improves with increasing fleet size.However, previous studies often only consider point predictions, whereas this paper also applied probabilistic predictions.Ensemble models, especially ada, bagging, and random forest, were shown to be robust across different aggregation levels, making them a reliable choice for different scenarios.These findings are comparable to those of Rathore et al. (2023).Although the machine learning models performed best and were superior to the benchmark model, our deep learning models do not utilize their full potential.This can be recognized by comparing the results of the models in Table 3 and Tables 6, 7, 8.With additional adjustments in the deep learning architectures and additional hyperparameter tuning, they could improve their adaptability to different levels of data granularity and achieve enhanced results.Furthermore, exploring additional features and feature transformations can improve model performance, especially for deep learning models at finer levels of aggregation.Regarding the limitations of using only the used algorithms, it could be argued that using different models such as support vector machines, k-nearest neighbor or especially state of the art deep learning model architectures such as PLCNet or temporal fusion transformer might achieve better results (Lim et al. 2019;Farsi et al. 2021).They are including not only national but also state-specific holidays as a feature that could further improve the results of all models.Further, regional weather data as input feature might improve the forecast accuracy as we initially only tested with German-wide data.The influence of historical data of different lengths has shown that, as expected, more data provides improved results, but a certain degree of saturation is indicated.In the future, it would be interesting to investigate the influence of annual data, as this would allow the models to better account for seasonal effects.For new sites that have little or no historical data, one possible approach would be transfer learning.Here, a global model is trained on the existing sites.New sites that still need to have sufficient historical data can then be predicted.In addition, data augmentation techniques can artificially generate data to have more training data available and ultimately may reduce overfitting.These approaches would be further possibilities for future research to investigate.The random aggregation of the individual sites has shown that robust results are achieved from approximately 200 chargers compared to the benchmark model.For the random aggregation of the individual sites, it would also be interesting to investigate how the models react to different behavioral dynamics, for example, by adding chargers to the existing sites or including new sites.Future research could aggregate sites by clustering hard-to-predict sites instead of randomly aggregating them or by training a global model on all data and then applying it to individual sites.Concerning the procurement of energy volumes, it was shown that an improved forecast leads to less balancing energy, but not to the same extent as higher procurement costs.It would be interesting for future work to investigate how procurement can be carried out in countries where it is also permitted to optimize expenses based on the balancing energy price.Additionally, this paper only examines static energy procurement without using the flexibility of electric vehicles, which results from the possibility of shifting charging processes.Future work should investigate how flexibility can be predicted.For example, one approach could be to predict the charging load and whether a vehicle plugs in as a time series per charging point to determine the shifting potential per charging point.In addition, the amount of energy and plugging duration could be determined as a regression for each plugging process, and the two models could be combined to increase reliability.The flexibility prediction and subsequent optimal energy procurement become particularly complex when bidirectional charging is considered, significantly changing the boundary conditions in energy procurement.A combination of prediction and optimization models, e.g., with reinforcement learning, is an approach that future research should examine.
The results presented relate to the charging processes of German companies, primarily from the commercial sector.If data from public or private charging stations is considered, different results will be obtained due to the significantly different charging behavior.By standardizing the charging load to the number, we have considered the ramp-up of electric vehicles.Furthermore, the results are transferable to other countries with similar companies.It should be noted that the charging load may change in the future due to the following factors: on the one hand, bidirectional charging will play an increasingly important role, allowing not only charging but also discharging.This will not only change the load but also give users an even greater incentive to keep their vehicles plugged in for long periods of time.Further technological advances in battery technology will increase the charging speed and the battery capacity, which will directly impact the charging load if vehicles have to charge less frequently but for longer, provided the power remains the same.
As companies might have limited time and resources, the following factors should be considered when implementing charging load forecasting models and any forecast model: sometimes naive models, such as the mean per day and time, yield accurate results without any implementation and maintenance effort.At the same time, they are easy to understand, and the computational costs are comparatively small.Therefore, in a real world application, the added value of a more accurate prediction should always be compared to the additional effort of developing and maintaining more complex models.

Fig. 1
Fig. 1 Methodology for comparison of different models for various aggregation levels

1Fig. 2
Fig. 2 Methodology for analyzing effects of forecast inaccuracies on energy procurement

Fig. 3
Fig. 3 Power per charging point for different aggregation levels during the test set

Fig. 7
Fig. 7 Hexplot for nRMSE (a) and MASE (b) for different group compositions according to their number of charging points

Fig. 8
Fig. 8 Wavelet plot for a group of ten sites with a nRMSE of 0.55 (a) and above one (b)

Fig. 9
Fig. 9 Day-ahead, intraday ID1, and balancing price for the test set use a transformers-based deep learning model to predict EV charging demand.However, they only forecast on a daily resolution.Van Kriekinge et al. (2021) apply a deep neural network to forecast the day-ahead charging demand of EVs in 15 15-min resolution.Additional features such as calendar and weather information reduce the root mean square error (RMSE) and mean absolute error (MAE) by 19.22% and 28.8%, respectively.Their final model has MAE lower than 1 kW for the day ahead horizon.While most studies focus on point forecasts, Buzna et al. (

Table 1
Data fields of the raw charging session data

Table 2
Included features Boxplot for nRMSE (top)and MASE (bottom) for different aggregation levels and models quality.Bagging, Ada, and RF have the most favorable MASE and nRMSE.Table3lists the metrics for the models for the portfolio.The best value is marked in bold, and the second best is underlined.The other model metrics for the different aggregation levels are listed in Tables 6, 7, 8.

Table 3
Metrics for the models based on portfolio level

Table 4
Results for energy procurement*in € per charger, **in kWh per charger costs per charger are € 0.227 for the ID1 and € 0.211 for the ID3, resulting in energy costs per charger of € 15.865 and € 15.850.As a result of the more accurate forecast of the intra15 model, more energy has to be procured, ultimately leading to higher costs.For the intra15 model, the costs per charger amount to € 0.36 for ID1 and € 0.34 for ID3 totaling € 16.00 and € 15.98.However, balancing energy quantities per charger of 12.40 kWh are required for the intra60 model and only 9.74 kWh for the intra15.To investigate the impact of a more accurate intraday forecast on costs and balancing energy, we reduced the error of the intra15 forecast by 5, 10, 15, and 20% compared to the actual value.The following costs in € per charger for intraday ID1 procurement result for the adjustment: 0.363, 0.366, 0.369, and 0.371.The total balancing energy in kWh per charger amounts to

Table 6
Metrics of the models for the TZO zones

Table 7
Metrics of the models for the zip codes

Table 8
Metrics of the models for the individual sites *Unitless, **in %, ***in W