Spatially Explicit Correction of Simulated Urban Air Temperatures Using Crowdsourced Data

Urban climate model evaluation often remains limited by a lack of trusted urban weather observations. The increasing density of personal weather sensors (PWSs) make them a potential rich source of data for urban climate studies that address the lack of representative urban weather observations. In our study, we demonstrate that carefully quality-checked PWS data not only improve urban climate models’ evaluation but can also serve for bias correcting their output prior to any urban climate impact studies. After simulating near-surface air temperatures over London and south-east England during the hot summer of 2018 with the Weather Research and Forecasting (WRF) Model and its building Effect parameterization with the building energy model (BEP–BEM) activated, we evaluated the modeled temperatures against 402 urban PWSs and showcased a heterogeneous spatial distribution of the model’s cool bias that was not captured using official weather stations only. This finding indicated a need for spatially explicit urban bias corrections of air temperatures, which we performed using an innovative method using machine learning to predict the models’ biases in each urban grid cell. This bias-correction technique is the first to consider that modeled urban temperatures follow a nonlinear spatially heterogeneous bias that is decorrelated from urban fraction. Our results showed that the bias correction was beneficial to bias correct daily minimum, daily mean, and daily maximum temperatures in the cities. We recommend that urban climate modelers further investigate the use of quality-checked PWSs for model evaluation and derive a framework for bias correction of urban climate simulations that can serve urban climate impact studies.

ABSTRACT: Urban climate model evaluation often remains limited by a lack of trusted urban weather observations. The increasing density of personal weather stations (PWS) make them a potential rich source of data for urban climate studies that address the lack of representative urban weather observations. In our study, we demonstrate that carefully quality-checked PWS data not only improve urban climate models' evaluation, but can also serve for bias-correcting their output prior to any urban climate impact studies. After simulating near-surface air temperatures over London and south-east England during the hot summer of 2018 with the Weather Research Forecast (WRF) model and its Building Effect Parameterization with the Building Energy Model (BEP-BEM) activated, we evaluated the modelled temperatures against 402 urban PWS and showcased a heterogeneous spatial distribution of the model's cool bias that was not captured using official weather stations only. This finding indicated a need for spatially-explicit urban bias corrections of air temperatures, which we performed using an innovative method using machine learning to predict the models' biases in each urban grid cell. This bias correction technique is the first to consider that modelled urban temperatures follow a non-linear spatially heterogeneous bias that is decorrelated from urban fraction. Our results showed that the bias-correction was beneficial to bias-correct daily-minimum, -mean, and -maximum temperatures in the cities. We recommend that urban climate modellers further investigate the use of quality-checked PWS for model evaluation and derive a framework for bias-correction of urban climate simulations that can serve urban climate impact studies. SIGNIFICANCE STATEMENT: Urban climate simulations are subject to spatially heterogeneous biases in urban air temperatures. Common validation methods using official weather stations do not suffice for detecting these biases. Using a dense set of personal weather stations in London we detect these biases before proposing an innovative way for correcting them with machine learning techniques. We argue that any urban climate impact study should use such technique if possible and that urban climate scientists should continue investigating paths to improve our methods.

Introduction
Although decades following the 1960s have seen an increase in the body of literature on urban climates (Oke et al. 2017), the scales of applicability and the transferability of the studies' outcomes are often limited. This can partially be attributed to the lack of observations representative of the variety of existing urban climates in cities. To address this limitation, two major solutions were proposed over the past 20 years: firstly, the development of urban surface energy balance coupled to regional climate models (e.g., Masson (2000), Martilli et al. (2002), Wouters et al. (2016)), and secondly, the increased interest towards crowd-sourced and low-cost weather sensors (e.g., Muller et al. (2015), Chapman et al. (2017), Fenner et al. (2017), Meier et al. (2017)). After proper validation and parameterization, urban climate models (UCMs) offer an unprecedented opportunity to represent the impact of cities on a wide variety of weather variables at very high spatial and temporal resolutions. This has been further supported by the recent development of global standardized land use land cover datasets designed for urban climate studies that permit their parameterization in cities formerly deprived of these data (see the World Urban Dataset and Access Portal Tool (WUDAPT) project; Ching et al. (2018), Demuzere et al. (2022)). Likewise, after proper filtering and quality control Fenner et al. 2021), crowd-sourced personal weather sensors (PWS) permit the extension of sensing networks into urban environments that were formerly not studied despite the fact that PWS often do not meet the standards imposed by official meteorological offices for implementation of weather stations. Several studies have demonstrated their range of applications since then (e.g., Fenner et al. (2019), Venter et al. (2020), Potgieter et al. (2021), Benjamin et al. (2021), Varentsov et al. (2021), Venter et al. (2021), Brousse et al. (2022).
One of the major limitations induced by the lack of official weather stations in cities is that quantifying existing uncertainties as a function of urban climate archetype is not feasible. This means that urban environments are poorly evaluated and have a higher chance of being inaccurately modelled because studies currently assume that UCMs will perform similarly for all types of urban environments that compose a city. In face of this challenge, crowd-sourced PWS could improve the evaluation of UCMs, as Hammerberg et al. (2018) demonstrated over Vienna. But the potential of PWS may even be greater, particularly when used jointly with or in parallel to UCMs. In fact, a recent study by Sgoff et al. (2022) improved the weather forecasting of the Icosahedral Nonhydrostatic Model (ICON; Zängl et al. (2015)) at a horizontal resolution of 2 km over Germany by assimilating the data provided by PWS for air temperature and relative humidity at 2 m height.
Although data assimilation occurs at runtime, PWS could also be used to bias-correct urban climate simulations as a post-processing step. Oleson et al. (2018) already noted the need for a global dataset of urban weather observations to properly bias-correct simulated urban climates. We indeed expect urban climate simulations to have systematic biases that can be induced for a variety of reasons, such as: urban canopy parameters (Demuzere et al. 2017;Hammerberg et al. 2018;Zonato et al. 2020); complexity of urban climate models (Grimmond et al. 2011;Loridan and Grimmond 2012;Lipson et al. 2021); time at which the simulation is initialised (Bassett et al. 2020); choice of initial and boundary conditions for lateral and vertical forcing (Brisson et al. 2015); or choice of model parameterizations -such as the two evaluated in this work (see Methods). Hence, UCM will always present a certain degree of uncertainty that has to be allowed for prior to performing urban climate impact studies that use climatic variables derived from modelled simulations to estimate the impact of the urban climate on other things (e.g. mortality, biodiversity, etc.). Using PWS could thus be beneficial for obtaining realistic urban weather data of present and future urban climates that can be used to perform urban climate impact studies and guide decision-making.
In this study, we propose to leverage the increasingly dense network of PWS over south-east England since 2015 (Brousse et al. 2022) to evaluate and bias-correct urban climate simulations that were run for the hot summer of 2018 -the hottest summer on average in the UK. Common practices in bias-correction include adding the mean bias to the modelled variable distribution or applying a separate correction to each quantile of the distribution (Maraun and Widmann 2018).
Model biases are usually measured at official weather stations at rural sites, thereby assuming that the urban heat island phenomenon is accurately represented by the UCM (e.g., Lauwaet et al. (2015), or Oleson et al. (2018)). Some studies however tried considering the urban effect by linearly transforming the bias-correction coefficient via an urbanization ratio calculated at each grid cell, like in Wouters et al. (2017) over Belgium. Assuming that urban climate simulations biases cannot be linearly related to the urban fraction only (here defined as the total non-natural fraction of a model grid that composes an urban canyon (street, roofs, building walls)), we decided to test whether urban in-situ observations can be used to perform an urban-specific bias-correction of air temperatures driven by machine learning.
We chose to use machine learning regressors to correct the air temperature biases because machine learning allows us to perform spatially explicit bias-corrections that are directly derived from the observed biases at all PWS locations and that are related to a set of spatially explicit covariates.
Machine learning regressors of ranging complexities allow for the statistical discretisation of a single relationship between the covariates and the variety of biases. To our knowledge, such a technique has never been proposed as a viable approach for bias-correction of urban climate simulations, probably because of the lack of observations in urban areas. We hereby hypothesize that such an innovative bias-correction method would be beneficial for urban heat impact studies by improving the UCM outputs on which they rely. Such innovations are needed to better assess the heat burden in cities (Nazarian et al. 2022).
To respond to these issues through the scope of urban near-surface temperatures, we: i) evaluated the ability of the complex three-dimensional UCM embedded in WRF -the Building Effect Parameterization coupled with its Building Energy Model (BEP-BEM) -to accurately represent the urban impact on air temperatures under two boundary layer schemes for the summer of 2018 in south-east England using official weather stations and PWS separately to show their added value for detecting spatially heterogeneous urban temperature biases; ii) used machine learning regressions to predict the models' daily air temperature biases in the urban environment and bias-correct the two simulations suggested in part i -which allowed us to determine an optimal time-step at which the bias-correction should be performed to optimize the outputs.; and iii) compared the two biascorrected products against the predicted daily air temperatures using only PWS measurements to investigate how realistic the bias-corrected products are. In parallel, to illustrate the benefit gained from the bias-correction for impact studies, we showcase how the bias-correction leads to different population weighted temperatures in the Greater London area. We also estimated the amount of PWS that are necessary to achieve optimal machine learning regressors performance and tested the added value of official weather stations for bias-correction.
It is important to consider that our study does not try to estimate how a bias-corrected modelled product is better compared to a predicted product from observations for urban climate impact studies. We hereby simply try to demonstrate that any urban climate impact work that is based on urban climate modelling should pursue a spatially explicit bias-correction specific to urban areas.

a. Model setup and region of interest
We focused our study on the south-eastern parts of England, centred over the metropolis of London, host to approximately 9 million inhabitants. We chose to model the impact of urbanization All domains used the same physical and dynamical parameterizations which we obtained out of preliminary testing done over the two hottest days of the summer 2018 -26 ℎ and 27 ℎ of July 2018 (see Appendix A). We thereby used the WRF Single-moment 3-class microphysics scheme (Hong et al. 2004), the Dudhia shortwave and RRTM longwave schemes (Dudhia 1989;Mlawer et al. 1997), and the revised MM5 surface layer scheme (Jiménez et al. 2012). In the first domain, the Kain-Fritsch convection scheme was activated (Kain 2004) and then turned off in the second and third domains, which were at convection-permitting scales. We set the model top at 50 hPa with an additional 5000 m damping layer and subdivided the atmosphere into 56 vertical layers. We used the Noah-MP land surface scheme (Niu et al. 2011;Yang et al. 2011) in its default parameterization over 4 soil layers.   Demuzere et al. (2019). Thermal and radiative parameters are also directly derived from the LCZ classification and follow those used by Stewart et al. (2014), who used these parameters for the city of Basel, Switzerland. Each parameter for roofs, walls and roads is related to each modal LCZ of the 1 km grid cell via the URBPARM LCZ.TBL (see Table 1). We decided to keep the roughness length for momentum and the lower boundary for temperatures of roofs, walls, and roads identical across each LCZ. We fixed the roughness length at 1.00E-4 m for walls and at 0.01 m for roofs and roads, respectively. This does not mean that the effective roughness length at the bulk level does not differ between urban morphologies. Although materials composing them are considered identical in the drag they impose on the flow, their density and height will matter.
Urban canyons with buildings above 25 m and another with buildings below 5 m will effectively have a different roughness length. For the boundary temperatures, we set it at 299 K for the roofs and the walls, respectively, and at 293 K for the road. We chose to deactivate the air conditioning in our simulation because air conditioning systems are not common in residential areas across London and surrounding cities, which compose the major part of the land use land cover.
In this study, two potential planetary boundary layers (PBL) schemes are compared in terms of performance and need of bias correction: the commonly used Bougeault-Lacarrère scheme (BouLac; Bougeault and Lacarrere (1989)) for urban simulations that use BEP-BEM, and the recently coupled YSU scheme to BEP-BEM (Hong et al. 2006;Hong and Kim 2008;Hendricks et al. 2020). Although we found that the latter performed better over the two hottest days of summer 2018 (see Appendix A), we decided to keep a simulation with BouLac as YSU has only been applied over Dallas (Wang and Hu 2021) whereas BouLac has been used in multiple studies already (e.g., Salamanca (2001)) scheme, also available for BEP-BEM simulations, is disregarded in this study since this PBL scheme is especially used for mountainous terrain (Zonato et al. 2022), and we are modelling the relatively flat terrain of south-east England.

b. Model evaluation prior to bias correction
We evaluated the model's performances against 35 official weather stations' measurements of air temperature at 2 m obtained from the UK Met Office MIDAS network (Sunter (2021), UKMO (2021); Figure 1, lower panel). To address the issue of lack of official observations amongst the urban environment, we used Netatmo PWS to complement the model evaluation ( Figure 1, lower panel). The Netatmo PWS measurements were obtained through the Netatmo App developer API and were collected for all PWS contained within the inner most domain of WRF and that were running over the 2015 to 2020 period (more information can be found in (Brousse et al. 2022)).
Prior to the evaluation, unrealistic PWS measurements were filtered out using the Crowd-QC v1.0 R package from Grassmann et al. (2018). This statistical quality check and filtering method is After quality-checking the PWS we also added an additional filtering where we removed PWS that did not have sufficient temporal data coverage and that were not located in an urban pixel according to WRF. Only PWS that have less than 4 hours per day without data and that are located in urban pixels with an urban fraction greater than 0 are retained -where the WRF land-use land-cover at 1 km horizontal resolution refers to an LCZ. This ensures that we do not include measurements that are not representative of the daily variations in air temperatures or built-up environments. Additionally, the prior filtering performed using the CrowdQC package also ensures that measurements that are not representative of outdoor thermal variations (e.g., indoor sensors) or that are resulting from defective sensors are taken out. Overall, the filtering step is necessary to ensure that our model outputs are evaluated against measurements of sufficient quality and that the subsequent bias-correction is deprived of unnecessary noise in the data that could lower its performance. This resulted in a sample of 402 PWS usable for model evaluation and bias correction. Out of these, 354 were located in WRF grids classified as LCZ 6, 30 in LCZ 5, 8 in LCZ 2, 6 in LCZ 8, 3 in LCZ 9 and 1 in LCZ 3.
Each model simulation was evaluated using a set of common statistical indicators: the root mean squared error (RMSE), the mean absolute error (MAE), the mean bias error (MB), Spearman's coefficient of correlation (r) and the square of Pearson's coefficient of correlation (r 2 ). These metrics are obtained using the Python scikit-learn and scipy's stats packages from Pedregosa et al. (2011) and Virtanen et al. (2020).

c. Bias correction using personal Netatmo weather stations
In our study, we propose an innovative method to bias-correct urban temperatures at a horizontal scale of 1 km by using machine learning regression. The advantage of using machine learning regression compared to more common bias-correction strategies (e.g., the definition of a single bias coefficient) is that we are able to relate our model output biases out of spatially varying and explicit sets of parameters. In our case, we make the assumption that the spatial variation in the bias of the model is dependent only upon the spatial morphological inputs to the UCM. These include the urban fraction, the surface height, the average building height, the building surface to plan area fraction ( b), the plan area fraction ( p) and the frontal area fraction ( f). Using this set of predictive covariates, we train our regressors to predict the bias in the modelled air temperature at 2 m (T2) based on observed biases at urban PWS locations. This way, we are able to bias-correct the modelled temperatures in each urban pixel based on the predicted bias (T2 -bias ). Our bias-correction does not make use of official MIDAS weather stations as their use is considered detrimental to the bias correction following an analysis on sample size and sensor types given in Appendix B.
We chose to bias-correct the simulated daily minimum, maximum and average T2 (T2 , T2 , and T2 ) using filtered PWS observations in London and south-east England. Daily temporal scale is considered optimal as it combines a higher spatial density of measurements compared to hourly data and a lower computational requirement; it is also a commonly used temporal scale  Lastly, to illustrate the potential benefit of modelled air temperature bias-correction prior to urban heat impact studies, we calculate the average population weighted temperatures -based on the United Kingdom census data from 2011 -in Greater London before and after the bias-correction.

a. WRF simulation evaluation
When we evaluate the two model simulations against MIDAS official weather stations only, they perform similarly, demonstrating a systematic negative bias of ∼0.55 • C on average ( In general, the bias is more important at night, and, in non-urban stations, performances are similar. Hence, looking only at the models' performances using standard in-situ observations doesn't provide information on which model represents the urban climate more accurately.
On the other hand, comparison with PWS observations identifies differences in performance in urban areas between the models, as shown by the performance metrics plotted in Figure 3 and C1.
The BouLac simulation has a stronger cool bias of -1.46 • C ± 0.6 • C on average in the urban area, parallel, the correlation with the PWS is lower in the suburban areas and higher in the centre of the city. This could mean that YSU accurately represents the urban temperatures on average due to compensating effects, which we do not intend to evaluate in this study. Nevertheless, this shows how PWS are beneficial for capturing the spatial heterogeneity of each model's performance and therefore supports the use of spatially-varying bias-correction.

b. Bias correction of urban climate simulations
Over our domain of study covering south-east England during the Summer 2018, both models are subject to a cold negative bias of ∼-0.5 • C on average according to official stations and of ∼-1.0 • C to ∼-1.5 • C according to PWS. But as demonstrated above, the bias of the models against PWS observations has substantial spatial variation and so the bias correction for urban heat impact studies should be spatially explicit.
We find that each machine learning regressors give similar performance (Figure 4; values numerically given in Tables C1 and C2 ). All bias-corrections were however beneficial compared to the  for regressions that were trained on the summer time-mean average of daily-minimum, -mean or -maximum temperatures, and "tstep" for those that were trained with the temperatures at each daily time-step.
(by 0.05 • C for YSU daily-maximum temperatures for example) -by the time-step bias-correction.
Interestingly, the spatial correlation between the bias-corrected and the observed temperatures are Comparing the spatial differences of the bias-corrected products related to the complexities of each regressors, we find that although each regressor is performing similarly on average, important disparities are found between the outputs. For example, when looking at the average bias-correction imposed to daily-minimum temperatures after training the regressors at each time-step, the Lasso and the Ridge regressors impose a flat bias-correction, similar to the dummy regression, while the random forest and gradient boosting regressors' degrees of freedom result in a spatially diverse bias-correction ( Figure 5 and Figures C5 and C6). Besides, the linear regression imposes an average bias-correction spatially-correlated to the modal LCZ. In general, the signal is consistent across each regressors, apart from the Lasso and the dummy regression, where, for YSU, central London requires a stronger bias-correction by 1 • C to 2 • C • C compared to the suburban areas where the bias-correction is around 0.5 • C ; for BouLac, the central bias-correction is lower than YSU. We find that these spatial tendencies are also found for daily-maximum and daily-average temperatures, defending our hypothesis of a systematic bias correlated to spatially explicit input parameters. The spatial differences in bias-correction are however less important for daily-maximum temperatures, which is the time at which the urban heat island is also expected to be the lowest. Finally, we find that the bias-corrected BouLac simulation corresponds spatially to predicted temperatures using PWS more than YSU -something we find equally across all regressors ( Figure 6 and Figures C7 to C11). As an example, when comparing the average bias-corrected products using the time-step trained random forest regressor we can see that YSU urban heat is more homogeneously distributed than BouLac's or the predicted temperatures from PWS only. BouLac's bias-corrected product shows stronger urban heat in central London compared to suburban areas, coherent with the predicted temperatures. Nonetheless, BouLac's suburban areas are hotter by 0.5 • C to 1.0 • C than the predicted ones with PWS only. This remains less pronounced than in YSU. Lastly, we can see that both bias-corrected products show similar trends when compared to the PWS-only predicted temperatures with hotter suburban areas and cooler secondary cities as well as coastlines. Again, this does not show which product between the PWS-only predicted temperatures and the bias-corrected products is better since we do not evaluate this here.
These results show that bias-correction of modelled air temperature change their spatio-temporal distributions. When focusing on the potential impact bias-correction may have in estimated urban heat impact on urban health, we find that using the random forest regression trained at each daily time-step leads to an increased average population weighted temperature by 0.77 • C in the YSU case, and of 1.24 • C in the BouLac case. Raw model outputs are thereby lowering the impact of heat on the urban population.

Discussion
In this study, we argue that the joint use of data from crowd-sourced personal weather stations (PWS) and urban climate models (UCMs) can add value to urban climate research and in particular to urban climate impact research. This is supported by two major outcomes of our case-study focused over London during the summer 2018. First, we showed that evaluation of urban climate simulations using PWS enables the detection of spatially-varying systematic biases in urban areas related to the UCMs' parameterization, which are not detectable using only official weather stations.
Second, we demonstrated that PWS, combined with detailed morphological data derived from LCZ maps, can be used to derive a spatially varying bias-correction via commonly used machinelearning regressors. This latter point has major implications for urban climate impact researchand especially future urban climate impact studies -as we hereby propose the first bias-correction technique that considers the existence of a non-linear spatially heterogeneous bias in modelled urban climates. indicators (Brousse et al. 2023). For example, Venter et al. (2020) found that a density of one PWS per square kilometre is optimal for predicting seasonal air temperature in Oslo. Dense PWS networks hence permit the detection of systematic biases that would otherwise pass undetected.
Therefore, to support the development of PWS as a source of urban weather observations for model evaluation, urban climate scientists should identify an optimal density of PWS for UCM evaluation, to define which cities are in need of urban weather observations, and to start instigating common frameworks and standards.
We consider our study innovative and supportive of future advances in the field because it is the first bias-correction technique in urban environments which considers that the accuracy of the simulated UHI is spatially heterogeneous due to the complexity of the urban surfaces and the lack of a linear correlation between urban environmental archetypes and temperatures at local scales.
Aided by the expanding fields of crowd-sourcing weather observations through PWS, machine learning, and potentially deep learning, we infer that our work should serve as the basis of future research that would try to improve the bias-correction of urban climate models using PWS. For instance, we did not find any machine learning regressor to be more efficient at predicting the model bias. This could be explained by the rather restricted set of covariates we used for training the regressors as well as the coarse horizontal resolution of 1 km at which the covariates were aggregated to be consistent with the model's spatial resolution. Higher spatial resolutions and more specific satellite earth observations could be used to improve regressors' performance, following up on the work by Venter et al. (2021), for example. When modelling the near-surface UHI their regressor achieved similar performances as ours, with an RMSE of 1.05 • C and a Pearson's r 2 of 0.23; it is important to note that these are the performance metrics for predictions of temperature rather than urban climate model bias. Although the common use of model's input parameters and earth observations as covariates could be beneficial, particular attention should be given to the choice of earth observations since these should not be decorrelated to the model's physics and dynamics as the purpose would remain the bias-correction.
Independent of the set of covariates used in this study we found that the regressors performances greatly improved when trained over a certain number of PWS (more than ∼90) before plateauing.
Because of this, future research should try to investigate how machine learning regressors could benefit from unfiltered PWS data and other PWS data sources. Interestingly, we found that official sources of data like MIDAS were detrimental to the regressors, potentially because official weather stations tend to be placed in open fields or parks without surrounding built-up areas to increase measurement accuracies. This would explain why our regressors tended to further increase the systematic cool bias when using only MIDAS stations for training as parks are typically cooler at night and on average than more urbanised areas where PWS are located. In addition, we found that training regressors at the daily time-step did not outperform a training with the summer time- could also help integrate these spatial heterogeneities for an improved bias-correction.
In general, we support the use of quality-checked PWS observations for bias-correction of urban climate simulations. As shown in this case study, model outputs prior to any bias-correction could lead to under-or over-estimation of urban heat impact on public health. We indeed find that for the summer 2018 in London, average population weighted temperatures -which is directly correlated to heat-related mortality -were higher after bias-correcting the model outputs. This suggests that there could be a higher urban heat related mortality during this period that would not be captured without bias-correction. This simple example shows that bias-correction of urban climate simulations could have important implications for calculating the exposure of urban citizen to heat or estimating the urban heat-related mortality. Although preferring bias-corrected model outputs to predicted urban air temperatures from earth observations for present-day urban heat impact studies is not covered in this study -and must be further explored -we still argue that bias-correction should be done prior to any urban heat impact studies that imply using climate model outputs.
This argument is especially valid for future climate projections at urban scale and we encourage future research to investigate how to transfer present urban bias-correction coefficients to simulated future urban climates. Doing so, bias-corrected simulations could help targeting areas where heat mitigation or adaptation strategies could be more beneficial as their efficiency is dependent on their location and scales of implementation (Yang and Bou-Zeid 2019;Broadbent et al. 2022).
We also suggest that our methods could be extended to other fields of urban climatology and urban air quality. Several devices already offer the possibility to obtain information on air quality, precipitation or wind speed, to name a few (De Vos et al. 2020). Hence bias-correction of regional climate models' outputs using crowd-sourced data should not be restricted only to air temperatures.

Conclusions
We demonstrate that the higher density of quality controlled data from personal weather stations

Model sensitivity testing over the two hottest days of Summer 2018
Prior to running the 3-months simulation, we tested the model's sensitivity to a set of parameterization to assess which model is the best performing model for the 3-months simulation. We From there, our simulations tested: i) the use of YSU, recently coupled to the BEP-BEM model (Hendricks et al. 2020), instead of Bougeault-Lacarrere; ii) the use of the more complex land surface scheme Noah-MP in its default parameterization instead of the default Noah land surface model; iii) the forcing by ERA5 reanalysis data at 25 km horizontal resolution instead of ERA-Interim; iv) the reduction of soil moisture by 50 % and its increase by 200 %, following suggestions provided by Martilli et al. (2021). We chose not to test the impact of urban canopy parameters in this case to keep our simulations standardized and universally coherent through the LCZ scheme.
Their simulation used the same micro-, clouds, convection and radiation physics than ours.
We found that all steps taken from the original parameterization by Heaviside et al. (2015) were beneficial to the model's performance. Through an intermediate simulation where we tested again the BouLac turbulence scheme after step iii, we found that YSU was still performing better.

Sensitivity of machine learning regressors to data quality and quantity
Before running our bias-correction and our bootstrapping we needed to evaluate the degradation In this case, we evaluate the bias-corrected temperatures against the observed temperatures. We chose not to run over daily time steps as this would be too computationally expensive.
We followed a bootstrapping procedure, where 20 % of the PWS temperature data were randomly selected and kept for testing the regressors performance. Random samples with increasing ratios of the remaining 80 % of PWS temperature data and covariates were used to train the regressors 25 times. We ensured that the randomly sampled 20 % and ratios are kept constant between regressors.
We first started with 1 % of the remaining 80 % and increased the ratio by steps of 1 % until 10 % of the remaining 80 %. Steps of 10 % were then used until reaching 90 % of the remaining 80 %.
We chose to use these steps as we expect our regressors performance to rapidly increase with a low amount of data before plateauing with a greater amount of data. Then, to test the added value of urban PWS density and data we trained the same regressors over the modelled bias at the 10 urban MIDAS stations locations and evaluated the bias correction against the 20 % of the PWS data kept for evaluation at each bootstrapping step. As a comparison, we also evaluated the WRF output prior to bias correction against the same 20 % of PWS temperature data at each bootstrapping step to demonstrate the added value of bias correction using a certain amount of PWS.
We found that all regressors benefited from a greater amount of PWS data which reduced the root mean squared error (RMSE), the mean absolute error (MAE) and the mean bias (MB) on average and also reduced the variability of performances between each bootstrap sample. Only gradient boosting showed a slightly deteriorated performance by having more than 30 % of the 80 % PWS data used for training (96 PWS) -probably due to overfitting. Below 40 PWS, all models performed poorly. We also showed that training the regressors over official MIDAS data only led to a poor bias correction for both summertime average daily minimum and mean temperatures.
For the maximum, no clear benefit was demonstrable, which was also the case with PWS and which could be explained by the lower UHII during hot hours of the day, as discussed in the manuscript. We argue that this general outcome is explicable by the standard location of MIDAS   B1. Regressors performance for bias correction of the summer average daily-minimum air temperature depending on the amount of weather stations' used for training. The performance is evaluated with the mean absolute error (MAE; in • C), root mean squared error (RMSE; • C) and mean bias (MB; • C). Blue dots represent the WRF model performance prior to bias-correction, in orange are the performance of the WRF model after bias-correction using MIDAS official weather stations, and in grey are the performance of the WRF model after bias-correction using subsets of the available Netatmo personal weather stations. Small lighter dots are representative of performances measured at each bootstrapping steps (n=25) and large darker dots are the average of all bootstraps. Here the WRF model was run with the Bougeault-Lacarrère boundary layer scheme (BouLac).

Additional Figures and Tables
This section presents all the figures that are not given in the main text.  Table C1. Performance metrics used in Figure 4 for the model using BouLac prior to the bias-correction (WRF) and all the different regressors (random forest: RF; linear regression: LR; Ridge regression: RD; Lasso regression: LA; gradient boosting: GB; and dummy regression: DU). The different regressions are assigned a suffix: "avg" for regressions that were trained on the summer time-mean average of daily-minimum, -mean or -maximum temperatures, and "tstep" for those that were trained with the temperatures at each daily time-step.   Table C2. Performance metrics used in Figure 4 for the model using YSU prior to the bias-correction (WRF) and all the different regressors (random forest: RF; linear regression: LR; Ridge regression: RD; Lasso regression: LA; gradient boosting: GB; and dummy regression: DU). The different regressions are assigned a suffix: "avg" for regressions that were trained on the summer time-mean average of daily-minimum, -mean or -maximum temperatures, and "tstep" for those that were trained with the temperatures at each daily time-step.