Penalizing complexity for better stats

Founded on the Occam’s razor principle of “the simpler the better,“ a statistical package has been developed by KAUST researchers that optimally adjusts flexible statistical models for spatiotemporal data. The approach, implemented in widely used statistics software, will help researchers to make more accurate predictions from observational datasets.

Statistical methods are the principal tools researchers use to make sense of observational data. Some datasets have straightforward statistical representations, such as the distribution of heights in a population, with an average and a roughly equal distribution of higher and lower values. However, many environmental phenomena do not follow such simple “Gaussian“ distributions and require more flexible statistical interpretations.

A non-Gaussian distribution with a different shape can be derived by training a model to fit the observed data using various statistical methods. However, on its own, this approach can result in strange results that stray from the underlying phenomena.

“Trained non-Gaussian models tend to ‘overfit‘ the data too easily, particularly when the data size is not large,” says Rafael Cabral, a statistician who studied his Ph.D. at KAUST with Professor Håvard Rue. “By training too well on observed data, this approach can capture random fluctuations rather than the underlying pattern.”

This situation can arise when a model has too much flexibility relative to the available data, and so tends toward more complex non-Gaussian models when a simpler Gaussian model might actually be the best choice. Cabral, with David Bolin and Håvard Rue, devised a framework to keep this training process in check^[1].

“To address this, we made use of Bayesian learning, which allows us to incorporate prior knowledge and beliefs about the data into our model in a rigorous quantified way,” says Cabral. “The output is then a compromise between the evidence provided by the data and our prior beliefs, which penalizes model complexity and gives preference to the simpler Gaussian model in line with Occam’s razor principle.”

The research team demonstrated the utility of their approach by applying it to spatial temperature and pressure data, achieving a non-Gaussian model with better predictions than other approaches by controlling the model’s flexibility.

Importantly, the team has implemented their model in standard statistical software to provide automated control over the application of flexible statistical models.

“This implementation provides researchers with a method for applying non-Gaussian models to real data, with the advantage that it is computationally efficient and fairly automatic to use in software,” says Bolin, who leads KAUST’s Stochastic Processes and Applied Statistics Group.

The statistical codes are publicly available for R-INLA and Stan via Github (stan-dev/connect22-space-time).

Reference

Cabral, R., Bolin, D and Rue, H. Controlling the flexibility of non-Gaussian processes through shrinkage priors. Bayesian Analysis (2022).| article

ABOUT THE AUTHOR

Rafael Medeiros Cabral

Alum

In Haavard Rue's Bayesian Computational Statistics and Modeling research group, Rafael's research interests are mainly focused on data science, Bayesian and computational statistics. He carried out this research in collaboration with KAUST's Professor David Bolin and Professor Harvard Rue.