Average Annual Air Pollution Levels: Model Data, 2024

Methodology

Modelling and Forecasting – How the Process Works

The actual forecasting model uses a method called XGBoost, a modern machine learning technique based on decision trees. This method has proven particularly effective at identifying complex relationships across a wide range of data sources.

A separate model is trained for each relevant air pollutant, i.e. NO₂, PM₁₀, and PM₂.₅. This allows each model to focus on the influencing factors specific to that pollutant and fluctuations typically associated with it. To ensure they stay current, the models are retrained monthly. This allows them to account for changes in traffic behaviour or weather patterns, and other environmental variables, maintaining their adaptability.

Forecasting is carried out on a high-resolution grid with 50 × 50 metre cells. This means that each grid point receives a predicted air pollution level. The fine resolution is key to capturing even small-scale variations in air quality across the city.

The model uses a wide range of features, including:

temporal variables – year, day of the week, time of day, and specific periods like school holidays or public holidays,
meteorological data – temperature, wind direction and speed, precipitation, and other weather-related variables,
spatial factors – building density, proportion of green space, and the road network, all of which affect how pollutants disperse,
traffic data – predicted vehicle counts and speeds for each grid cell, and
past measurements – lagged variables or ‘lags’, which allow the model to identify and respond to short-term trends.

Traffic Forecasting as an Auxiliary Model

Since traffic is one of the key factors influencing air quality, a separate traffic model is trained. It also relies on XGBoost and is tasked with predicting vehicle volumes and speeds across the urban area. These predictions then feed into the air pollution model.

It is important to note that this traffic model is retrained on a quarterly basis, as the most recent quality-assured traffic data are only available after a delay. Weather data are deliberately excluded from this sub-model, as they have little relevance for short-term traffic developments.

Model Quality Assessment – How Effective Is It?

To ensure the reliability of the model’s predictions in practice, the quality of the model is verified using several statistical methods:

The MAE (mean absolute error) indicates the average deviation between predicted and actual values.
The RMSE (root mean square error) gives more weight to large deviations, highlighting whether major errors occur in isolated instances.
The coefficient of determination or R-squared reflects how well the model explains the actual variability in the data.

Please note, this evaluation is conducted out-of-sample, meaning it is based on new, unseen data rather than the data used for model training. This approach provides a realistic assessment of the model’s predictive accuracy for everyday use cases.

The Berlin air pollution forecasting model is a strong example of how state-of-the-art data analysis and machine learning can effectively support urban planning. By integrating current measurements, weather and traffic forecasts, and detailed urban data, it delivers high-resolution, scientifically sound assessments of air quality – citywide and in real time.

Regular updates, the integration of numerous influencing factors, and the ability to evaluate the impact of individual measures render this system a powerful instrument for environmental analysis and data-driven policy development.

Environmental Atlas Contact

Berlin Senate Department for Urban Development, Building and Housing
Mr. Hartbecke

Tel.: (030) 90173-5337
E-mail Mail to Mr. Hartbecke