Dissertation: Towards more accurate environmental prediction – research reveals the underlying structures behind air pollution and weather
Environmental data, such as air pollutant concentrations, temperatures, and precipitation levels, are collected simultaneously from thousands of measurement stations around the world. The statistical analysis of these large datasets is exceptionally challenging, as modeling must account for both the dependencies between variables and complex spatial and temporal structures.
- Ozone is a good example of a harmful air pollutant whose formation is a multi-stage phenomenon. It does not arise directly from emissions, but forms through the combined effect of combustion-related emissions and sunlight in complex chemical reactions. The amount of emissions and sunlight is in turn strongly influenced by the season, geographic region, and several other weather-related variables. Therefore, when modeling and predicting ozone concentrations, it is important to take all of these underlying factors into account, explains Doctoral Researcher Mika Sipilä from the University of Jyväskylä.
The method revealed familiar phenomena in a new way
Doctoral Researcher Mika Sipilä developed machine learning based methods in his statistics dissertation that searched for statistically independent latent variables underlying the data. Together, these latent variables capture all the essential information from the observed data. A key strength of the methods is that they leverage the spatial and temporal structure of the data to identify latent variables – something that has previously been a theoretically very difficult problem in the context of nonlinear modeling.
- From air pollution and weather data, the method succeeded in identifying three easily interpretable latent components. One described combustion-related emissions, another precipitation and humidity, and the third the photochemical process triggered by sunlight in which ozone is formed. Therefore, the method was able to automatically discover structures in the data that closely match current scientific understanding of the phenomena underlying air pollution and weather, explains Sipilä.
Latent variables enhance forecasting
The developed methods can be used to predict the values of observed variables both into the future and for locations where no measurement stations exist.
- Because the latent variables are independent of one another, they can be modeled individually, which makes predicting computationally efficient, while still enabling highly accurate results, clarifies Sipilä.
The public defense of FM Mika Sipilä’s doctoral dissertation in statistics, “Identifiable variational autoencoders for modeling spatial and spatio-temporal data“ will be held on 15 May 2026 at 12:00 in Agora Auditorium 3. The opponent is Professor Andreas Artemiou (University of Limassol) and the custos is Assistant Professor Sara Taskinen (University of Jyväskylä). The defense will be held in English.
The dissertation is available online at: https://urn.fi/URN:ISBN:978-952-86-1479-1.