### Distribution model

#### Model training

In this subsidiary step, four machine learning algorithms are trained on the basis of the training data. These are: OCSVM (One Class Support Vector Machine), logistic regression, random forest, and neural network. For each algorithm, a set of well-established hyperparameters is defined and cross-validation is performed for their combinations. For the three binary algorithms, the imbalanced dataset is balanced with the unequally distributed pseudo-absence data, using class weighting. A strong regularization was chosen for all algorithms to counteract overfitting. Due to monotonicity in the data, the sigomoid activation function was chosen for the OCSVM and the neural network. With just one hidden layer, the neural network is not structured to be overly complex in this study. The number of hidden neurons is chosen to be about twice as large as the number of input neurons, which allows the model to learn to a greater depth of detail.

#### Model validation

To limit the spatial autocorrelation (see Griffith 1992) of the data, spatial cross-validation is used for model validation. The subsets for this validation are thereby generated from the municipalities (see Fig. 1). Thus, the algorithms are always trained for a subset of communities, their parameters optimized, and validated on a subset of “unseen” communities. Roberts et al. (2017)

### Results of trial runs and model selection

#### ROC curves

For the ROC curves (Fig. 2), the test data are analyzed according to the cells classified as correct positive and false positive. Curves are obtained that provide information about all combinations of the output score in relation to the two positive rates. An initial indication of the quality of the models is provided by these curves. Regarding the false positive rate, it must be noted at this point that the negative examples concern the pseudo-absence data. These are a randomly chosen large selection of background data and therefore also “positively contaminated” to a certain degree. The stronger the “contamination” of the data, the flatter the curves.

#### Model selection

Based on the test output as per AUC, the distribution of the output score and the interpretability of the models, logistic regression is shown to be the most suitable algorithm for this use case despite a slightly worse AUC compared to the neural network.

### Evaluation and depiction of distribution model

#### Evaluation of coefficients

One of the questions addressed in this paper is what external factors influence the distribution of wallboxes. This will be answered in this section by performing a coefficient analysis. Firstly, Fig. 3 shows on the left side in which direction and to what degree the coefficient influences the model. The right-hand window shows the variability of this value with cross-validation. A high degree of variability implies correlations or multicollinearity in the data. In summary, it can be said that, among other things, numerous PV installations, high purchasing power and many “GREEN” voters per cell will favor the prevalence of e-mobility, whereas extremely large and small population densities, large building plots, low purchasing power and a limited age group will militate against its adoption.

#### Geographical depiction

In this section, the results of the logistic regression will be presented geographically for the reader. In Fig. 4, this distribution can now be represented in a high-resolution geographic map. The result thus appears as a 100x100 m geographic grid showing probabilities for the occurrence of wallboxes and depending on factors in the surroundings of the cells. As can be seen in Fig. 4, the wallboxes that were known at the time of the study are mostly located in cells where the model outputs a high probability.

### Simulation of wallbox distribution

For blanket simulation of the impact of wallboxes on the power grid, the first step is to simulate an appropriate predicted distribution. For this, the outputs of the distribution model from the foregoing section are taken and, based on these, weighted random sampling is prepared. With the grid connection data and the geographic grid with its probability values, a probability can be assigned to each connection as a weighting factor. With these elements, a distribution simulation run can now be executed. To do this, each grid connection is extracted in turn from all connections depending on its assigned probability in successive simulation rounds. Each round defines a relative share of market penetration. A further distribution simulation parameter is the influence of the model. In order to raise the influence of the model, the simulation rounds are performed more frequently for each penetration level, while retaining the most frequently selected connections in the simulation. In this way, the influence of the model can be increased and thus the degree of randomness can be lessened. Selecting simulation rounds of 1, 20 & 100 proves to be the most appropriate for multiple simulations with different numbers of rounds. In the following simulations the influence of the model is expressed as textual parameter: