Optimization of district heating production with thermal storage using mixed-integer nonlinear programming with a new initialization approach

Non-convex scheduling of energy production allows for more complex models that better describe the physical nature of the energy production system. Solutions to non-convex optimization problems can only be guaranteed to be local optima. For this reason, there is a need for methodologies that consistently provide low-cost solutions to the non-convex optimal scheduling problem. In this study, a novel Monte Carlo Tree Search initialization method for branch and bound solvers is proposed for the production planning of a combined heat and power unit with thermal heat storage in a district heating system. The optimization problem is formulated as a non-convex mixed-integer program, which is incorporated in a sliding time window framework. Here, the proposed initialization method offers lower-cost production planning compared to random initialization for larger time windows. For the test case, the proposed method lowers the yearly operational cost by more than 2,000,000 DKK per year. The method is one step in the direction of more reliable non-convex optimization that allows for more complex models of energy systems.


Introduction
Scheduling of district heating production is a well-studied problem in literature (Deng et al. 2017;Lésko et al. 2018;Gopalakrishnan and Kosanovic 2015;Rong and Lahdelma 2007). It is vital to the daily operation of district heating systems in order to keep operational costs low and ensure proper functioning to meet demand at all times. Additionally, the optimal scheduling problem is used for techno-economic assessment of new potential investments and when building new District Heating (DH) systems (Elsido et al. 2017a). Generally, there are four formulations of the optimal scheduling problem, linear programming (LP) (Lozano et al. 2009;Rong and Lahdelma 2005), mixed-integer linear programming (MILP) (Söderman and Pettersson 2006;Arcuri et al. 2007), non-linear programming (NLP) (Bindlish 2016) and mixed-integer nonlinear programming (MINLP) (Deng et al. 2017;Lésko et al. 2018). The advantage of MILPs and LPs is that they can be solved with commercially available solvers for a global optimum. Despite this feature, using linear models to describe physical systems that typically are highly non-linear introduces error into the model, as a linear model of a non-linear system can at best only be a good approximation. Non-linear models on the other hand, while allowing better systems descriptions are inherently more difficult to solve. For non-linear models that are also non-convex, the problem is even worse, because a solution to a non-convex problem can only be guaranteed to be locally optimal. A common technique for dealing with mixed-integer non-linear models is to approximate the models as mixed-integer linear programs via linearization in order to guarantee that the solution is globally optimal (Lésko et al. 2018;Elsido et al. 2017b). The accuracy of a linear approximation is dependent on how many linear segments a nonlinear function is divided into. This introduces a trade-off between the complexity and the accuracy of the approximation. The more segmented the piece-wise linearization is the better the accuracy. However, each segment introduces additional variables to the model which makes the model bigger and more complex.
As an alternative to linear approximation, several methods exist for solving mixedinteger non-linear programs which do not rely on linearizing the problem. The methods can generally be divided into two groups, derivative-based and derivative-free. Derivative-free algorithms for mixed-integer non-linear programs include evolutionary, genetic, swarm intelligence algorithms (Elsido et al. 2017b;Boukouvala et al. 2016;Luo et al. 2007). Derivative-based methods employed in commercially available software for MINLP include cutting planes, branching and bounding (Boukouvala et al. 2016). As the solution of non-convex optimization problems is not guaranteed to be globally optimal, a common technique to find a good local solution is to repeatedly solve the optimization problem (Tveit et al. 2009;Savola et al. 2007). The employment of this method benefits from computationally efficient solvers and methods that consistently yield solutions with a low optimality gap.
In a recent study (Makkonen and Lahdelma 2006), the authors propose the use of the Power Simplex branch-and-bound algorithm to solve the non-convex scheduling problem of a combined heat and power unit operation. The problem is divided into hourly subproblems, which can be solved sequentially. This is a feasible choice in part because the systems do not have a thermal energy storage unit. The authors emphasize that the Power Simplex branch-and-bound algorithm is efficient because of its capability to reuse parent nodes to calculate child nodes. Rong and Risto (Rong and Lahdelma 2007) proposes an envelope-based branch-and-bound algorithm that employs a pruning technique to improve the computation speed of solving non-convex non-linear mixedinteger program for production planning of a combined heat and power plant. Similar to Simo and Risto (Makkonen and Lahdelma 2006) the problem is formulated as on hourly optimization. The authors succeed in decreasing the computation time significantly compared to ILOG CPLEX 9.0 MIP solver and the Power Simplex branch-andbound solver. Gopalakrishnan and Kosanovic (Gopalakrishnan and Kosanovic 2015) proposes a hybrid genetic algorithm for the solution of the non-convex optimal scheduling problem of a combined heat and power plant. It combines genetic algorithms for exploring the integer solutions space and employs a gradient-search for exploiting isolated regions of the solution space. The authors highlight that the proposed algorithm outperforms classical branch-and-bound algorithms in terms of capability to find integer feasible solutions and moreover that the optimality gap of solutions with the proposed method is lower than with branch-and-bound. Lastly, the authors emphasize that the proposed methods are computationally more efficient than classical branch-andbound algorithms.
In two studies (Tveit et al. 2009;Savola et al. 2007), the repeated initialization of the non-convex solver is based on random initial solutions. This method is intuitively good at probing the solution space, but its speed can be questioned. To improve the computational efficiency of solving non-convex mixed-integer non-linear problems, (Soares et al. 2015) proposes a warm start method in combination with the Differential Search and the Quantum Particle Swarm Algorithms. The warm start methods are based on the solution of the convex relaxation of the non-convex problem. The authors found that the warm start method combined with evolutionary and swarm intelligence algorithms is capable of drastically reducing computation time.
This paper proposes a new warm start initialization procedure based on a stochastic discrete tree search, called Monte Carlo Tree Search (MCTS), which constructs initial feasible solutions for a multi-period steady-state scheduling problem in a DH system with a combined heat and power unit and thermal storage. The authors are not aware of any other studies where tree search has been used as an initialization method. The proposed method is more consistent in finding good solutions than random initialization when the problem size grows. Therefore, the method constitutes an improvement over random initialization when planning for longer periods (12-14 h), which enables smarter storage use and lower operational cost.
The paper is structure as follows; 1) a mathematical model is first presented which describes the district heating system for which the scheduling problem is solved, 2) an optimization problem is defined based on the developed model, 3) the new Monte Carlo Tree Search initialization methods is introduced, 4) the experimental setup is defined, 5) the results are presented and discussed and 6) a conclusion is made.

District heating system model development
In this section, a mathematical model of a DH system consisting of a combined heat and power unit (CHP) with Thermal Energy Storage (TES) is developed. The model will briefly be presented, but not explained in detail as the model is simply regarded as a test suite for the optimization methods tested. The system is modeled as a quasidynamic system with a time step of 1 h and it is illustrated in Fig. 1.
The production-side is located on the upper half of the figure where the plant supplies thermal energyṖ H to the system given by 1). The plant operator then has the option to either charge or discharge the TES resulting in the thermal energy flowQ flow1 to or from the pipe node given in (2). The node has been marked with a blue ring on the figure. The direction of the flow depends on the binary variable x i . (3) gives the thermal energy supplied to the transmission line while (4) and (5) are the energy and mass balance equations for the node respectively. Lastly, (6) describes the temperature of the water flowing to or from the storage, which also depends on the flow direction.
The heat loss per increment pipe is proportional to the temperature difference between fluid temperature and ground temperature. For simplicity, it is assumed that the fluid temperature is constant during the whole pipi segment and only drops at the outlet. This gives the forward and return heat loss expressed in (7) and (8) respectively. Eqs. (9), (10) and (11) describe the temperature of the water flow in terms of the forward heat lossΦ f;i the heat demandḊ H and the return heat lossΦ r;i respectively. The U par L factor has been derived empirically to match data for one of the transmission lines of an actual CHP plant. Lastly, the pump work of the system is given in (12) as a third-order polynomial, which also has been derived empirically from data for a transmission line at a district heating company.
The TES is modeled as a lumped body meaning that the temperature is assumed uniform across the entire volume. The energy balance of the storage Q s, i is expressed in (13). The thermal energy added or removed from the storageQ s;pipe;i is given by (14) and the heat loss is given by Newtons Law of Cooling in (15) where the heatconducting surface area A s, i is varying with the mass as given in (16). The mass balance of the storage is given in eq. (17) and the temperature of the storage is expressed in (18).
The electricity produced by the CHP unit is modeled as shown in Fig. 2. The model is a simplified version of an actual CHP unit at a Danish district heating company. The figure shows that the plant has two modes of operation;Ṗ H; min1 ≤Ṗ H ≤Ṗ H; max1 anḋ P H; min2 ≤Ṗ H ≤Ṗ H; max2 . This is expressed mathematically in (19) by introducing the binary variable z.
Optimization problem The hourly cost of production is defined in (20), where the pump work,Ẇ P;i , and the electricity production,Ṗ E;i , is given in (12) and (19) respectively. EP i is the spot market electricity price and the second term, therefore, accounts for the revenue from the sale of electricity while the first term, accounts, for the fuel cost, where FP i is the fuel price andḞ i is the fuel consumption given by (21). The objective function of the MINLP is thus defined by summing the costs for each time step in (22). Minimize In (21) an equivalent electricity production is calculated from the heat production by multiplying the heat production with the ratio between electricity and heat production at constant fuel consumption, C v (The Danish Energy Agency and Energinet 2019). The production is then divided by the electrical efficiency η E . The constraints for the optimization problem are: The authors have decided to use a Branch-and-Bound solver named Apopt which is implemented in the APMonitor Optimization Suite. The optimization suite is free to Fig. 2 Illustration of the two line-segments used to model the electricity production as function of heat production. The binary z-variable controls which line is used use, offers free cloud computing services, and is compatible with Matlab and Python (APMonitor 2020a). The Apopt solver uses a combination of an active set method and Branch-and-Bound to manage the integer variables (John et al. 2014; APMonitor 2020b). As steady-state optimization of large problems can be time-consuming, the authors have chosen to adopt a sliding time window method for dividing the optimal scheduling problem into smaller more manageable problems which can be solved in sequence. The framework for the optimization can be seen in Fig. 3. In this framework, the optimization of each window is solved Try max times and the best solution to each subproblem is kept for the construction of the final solution.
The sliding time window method introduces two new hyperparameters, namely the length of the sliding window, W L , and the stride, W S , where W S ≤ W L . If the stride is less than the window length it means, there is an overlap between every pair of adjacent windows. If this is the case, the suboptimization of the latter window takes precedence over the region of overlap. Another important consideration is that the minimum look ahead at any discrete time step is given by W L − W S . As an example, consider a sliding time window optimization with W L = 5 and W S = 5, in this example the planning of the 5th hour will not have accounted for any subsequent hours. If W S = 1 instead, the scheduling at each discrete time step will have considered the subsequent 4 h.

Monte Carlo tree search initialization method
The implementation of MCTS in this work requires a static set of actions. This set of actions is composed of several layers, one for each discrete time step in the optimization period. Each layer consists of a number of actions, which are sets of decision variables. E.g. for the optimization problem presented in this work, there are 5 decision variables chosen as T v ,ṁ v ,ṁ x , x, z, which means an action could be defined as e.g. a = {60, 3, 1, 1, 0}. Given some state of the system s 0 = {m s, 0 , T s, 0 }, defined by the mass contained and the temperature of the storage, the action would then transition the system into a new state. Figure 4 visualizes an exemplified version of the actions available at each discrete time step. In the figure, each horizontal row is a layer consisting of several actions, shown as grey dots. The number of actions per time increment may be varied across time steps. Additionally, uniform noise is added to each action to ensure variance in the initial solution. It was found that increasing the number of discretizations for the variableṁ x improved the overall tree search for this problem. The static set of actions can be seen as a discrete mapping of the solution space. The task of the MCTS algorithm is therefore to search this set of actions to find good feasible candidate solutions in the discrete solution space.
When the set of actions for each hour has been created, it is deployed in an MCTS algorithm. The algorithm developed in this work is based on the Upper Confidence Bounds applied to Trees-algorithm (UCT) from (Kocsis and Szepesvári 2006) and the implementation in (Maddison et al. 2016). In this algorithm, each state-action pair (s, a) at simulation time t is associated with an expected reward Q t (s, a) and an exploration bias u t (s, a). Typically, the expected reward Q t (s, a) is calculated by averaging over all the backpropagated rewards that have passed through the node. Each reward is found in the selection phase when a leaf node is encountered. When this happens, random simulation is initiated from the selected leaf node until a terminal state is reached, determining the reward. The random simulation thus acts as a state evaluation function and is advantageous in situations where domain-specific evaluation functions are not available. This method was proved to guarantee an optimal policy when simulation time goes to infinity in (Kocsis and Szepesvári 2006). However, as argued in (Ramanujan and Selman 2011), in cases where a domain-specific evaluation function does exist, the search can be made more efficient and less time-consuming by replacing the random simulation with an evaluation function. In the case of an optimization problem, an objective measure already exists in the form of an objective function, which justifies the before-mentioned replacement. For the implementation in this study, Q t (s, a) will also not represent a reward, but rather a cost as the problem solved is formulated as a minimization problem. Additionally, preliminary experiments in this work confirmed, that calculating Q t (s, a) based on the minimum value among the child nodes of state-action pair (s, a), instead of averaging over backpropagated values, yielded better results as also discussed in (Ramanujan and Selman 2011). Therefore, Q t (s, a) is defined as shown in (23) while the exploration bias u t (s, a) is kept in the original form presented in (Kocsis and Szepesvári 2006), as given by (24) In (23), the expected cost Q t (s, a) of taking action a in state s is found by finding the minimum value among state-action pairs (s a , b) for the resulting child state s a . The set of actions available in state s a is given by M(s a ). In (24), M(s) is the set of actions available in state s and N t (s, a) is the visit count of taking action a in state s. The equation, therefore, compares the total visit count of the parent state s to the child state resulting from action a. The function is designed such that it grows when an action is picked less than the alternatives and therefore encourages exploration. c is a hyperparameter controlling the trade-off between exploration and exploitation. The action selected in each state, called the olicy π(s), is given by (25) based on both how promising the action looks; Q t (s, b), and the degree of exploration; u t (s, b). Figure 5 summarizes the algorithm in a flowchart.
To exemplify how the algorithm works, an illustration has been made in Fig. 6. The nodes drawn represent the system states while the edges represent actions transitioning the system from one state to another. As seen in Fig. 6a, where two nodes have already been expanded, the tree is traversed by iteratively selecting the best candidate among child nodes using (25). Each node stores two values in memory, Q t (s, a) and N t (s, b) which are updated each time backpropagation passes through the node. The selection is repeated until a child node is observed to also be a leaf node. In this case, a random child leaf node is then selected and expanded as shown in Fig. 6b. In this expansion, the feasibility of the chosen child node is first evaluated after which the scaled average objective value Q t, leaf is calculated and backpropagated through the parent nodes using (23). If the node is evaluated as infeasible, all child nodes can be pruned.
The average objective value is defined as the average objective of all the nodes traversed, including the frequently expanded node. This set of nodes is given as the set I in (26), where n is the number of nodes traversed. This is done to ensure comparability of the objective value when it is propagated back through the search tree. When the average f c has been calculated, it is scaled to a value between 0 and 1 according to (27). Here, f c, min and f c, max are the pre-determined lower and upper bounds for the objective function. They are found by treating (20) as a linear program with the variables W P, i , P E, i , and P H, i . The linear program is minimized and maximized separately for each of the two production modes in (19) for each hour i ∈ {1. . N}. This results in 4N linear programming solutions among which the minimum and maximum values, f c, min and f c, max , are extracted. This scaling makes it more convenient to tune the hyperparameter c in (24), as most use-cases of MCTS are within the domain of board-games where an outcome often is represented by 0 and 1 -loss and win. When the backpropagation reaches the root node, the selection starts over as shown in Fig.  6c, followed by expansion and backpropagation as shown in Fig. 6d.

Experimental setup
In order to test the effectiveness of the MCTS initialization method, it is incorporated in the optimization framework described in Fig. 3 as the "Find feasible initial solution"step. It is bench-marked against random initialization using the Apopt Branch-and-Bound solver (BB). Additionally, the MCTS initial solution will be fed to the Apopt Non-Linear-Problem solver (NLP), which is the local solver used for each subproblem in the branch and bound algorithm. The three methods are tested on an optimization period of 48 h with a high variance in electricity price EP and a low variance in heat de-mandḊH as shown in Fig. 7. As explained earlier in Fig. 3, this period is divided into several smaller optimization problems which are solved sequentially using the sliding time window approach. A stride of 5 is used for all simulations while the window Fig. 7 Electricity price (top) and heat demand (bottom) of the problem length W L is varied from 5 to 14. Because of the stochastic nature of both MCTS initialization and random initialization, each sub-optimization is solved for 5 different initial solutions before moving the time window, i.e. Try max = 5. The hyper-parameter c is initially set to 0.1 and is decreased by a factor of 0.9 for each 5e+ 4 iteration until a solution is found after which c is held constant for additional 5e+ 5 iterations. This is done to ensure that a solution is found within reasonable time and memory limits while still allowing for exploration. The whole framework illustrated in Fig. 3 is repeated 5 times for each time window length W L , and for each of the 3 methods. Figure 8 shows the best operational schedule for each of the three methods. It is evident, that all methods yield operational plans that work to mitigate and or take advantage of fluctuations in the electricity spot price by using the TES to allow for overproduction of heat in hours with high electricity price and conversely to underproduce in hours with low electricity price, covering the heat deficit with stored thermal energy. Even though the best scheduling is markedly different for the three methods, the relative difference in the total cost is only approximately 2%. The best solution is found by the MCTS+BB method which gives a total cost of 1098 TDKK over the 48 h period. Comparing this solution with the worst obtained solution of 1199 TDKK, among all 45 simulations, instead gives a relative difference of approximately 9%. This gives a difference of 18.25 mDKK/year, which shows the importance of choosing a method that consistently produces low-cost operational plans.

Results
To study the ability of each method to consistently produce low-cost operational plans, Fig. 9 is introduced, which shows boxplots of the cost and computation time of each method as a function of the size of the sliding time window. A key indicator of the consistency of a method in this presentation is the placement of the quartiles and the interquartile range. The interquartile range shows the spread in the operational cost of and time of the solutions of each method. The interquartile range of Random + BB Fig. 8 Best scheduling for each method. The MCTS+NLP solution was found with W L = 13 and a cost of 1120 TDKK. The MCTS+BB solution was found with W L = 12 and a cost of 1098 TDKK. The Random+BB solution was found with W L = 14 and a cost of 1099 TDKK increases as the time window increases, which is likely caused by the increase in the size of the solution space that makes it more difficult for the method to consistently yield low-cost solutions. Large time windows enable scheduling methods to make better use of the thermal storage to take advantage of fluctuations in the electricity price and heat demand, which is seen as the total cost decreases as the size of the sliding time window increases for all methods. A weakness of the Random+BB method is therefore that the consistency decreases with the time window size. MCTS + BB does not exhibit the same tendency, which indicates that it provides better consistency when the size of the solution space increases. This is more evident from Fig. 10, where the simulation results for windows in range 12 to 14 have been merged for the three methods. Here, the interquartile range is shifted down for MCTS + BB compared to Random + BB indicating that it is more feasible to use MCTS+BB for larger time windows. In fact, if the yearly cost is calculated assuming that the median cost of each method is applicable for the entire year, the cost of scheduling with MCTS+BB would be 2,182,700 DKK lower than Random+BB corresponding to 1.7 DKK/MWh.  α 2 Second order coefficient in pump work equation α 3 Third order coefficient in pump work equatioṅ Φ f Thermal power loss from return path of transmission linė Φ r Thermal power loss from return path of transmission linė D H Sum of heat demand at consumer and distributioṅ F Fuel consumptioṅ m v Outgoing mass flow rate at combined heat and power planṫ m x Mass flow rate to/from thermal storagė m f Forward mass flow rate in transmission linė P E Electricity production at combined heat and power planṫ P H Thermal power production at combined heat and power planṫ Q flow1 Thermal energy flow from node to the thermal storagė Q flow2 Thermal energy flow from node to the transmission linė Q s;loss Thermal heat loss from thermal storagė Q s;pipe Thermal energy flow in/out of the thermal storagė W p Pump work η E Electrical efficiency of combined heat and power plant in condensing mode π(s) Policy state s a action A s Area of heat conducting surface in thermal storage, which varies with contained mass c v Constant relating heat and electricity production at a constant fuel consumption C p1 Constant pressure specific heat of water with unit MJ ton°C: C p2 Constants pressure specific heat of water with unit MWh ton°C EP Electricity price i Subscript, hour M(s) Action set of state s m s Contained mass in thermal storage MC Marginal cost of fuel consumption Q t (s, a) Expected cost of action a in state s Q s Energy content of thermal storage s State T a Ambient temperature used to calculate storage heat loss T s Soil temperature used to calculate transmission heat loss T v Temperature of outgoing water flow at combined heat and power plant T x Temperature of water flow in pipe connecting to thermal T f1 Forward temperature in transmission line from node 1 T f2 Temperature of district heating water at forward transmission line outlet T r1 Temperature of district heating water at return transmission line inlet T r2 Temperature of district heating water at return transmission line outlet u t (s, a) Exploration bias of action a in state s U par L Thermal power loss coefficient of transmission line W L Length of the sliding time window W S Stride of the sliding time windows x Binary variable describing whether the thermal storage is charging or discharging z Binary variable describing the mode of operation of the combined heat and power plant