Using metabolomics data in predicting diabetes risk

In this blog text, doctoral researcher Eetu Kiviniemi gives insights into his recently published study in which the role of circulating metabolic biomarkers in risk prediction of type 2 diabetes was assessed.

What is a risk prediction model?

The FINDRISC risk prediction model, developed in Finland in 2003, is a statistical model that calculates the risk, or the probability, of developing type 2 diabetes within the next five years. The model is based on a small set of clinical and questionnaire type variables, such as sex, age, body mass index, or elevated glucose levels determined by a doctor. The model is widely used internationally, and you can test it yourself on the Finnish Diabetes Association website. Despite the small set of explanatory variables, the model is able to effectively discriminate between high and low risk persons.

A better risk prediction model?

Could the model be improved by leveraging more accurate measurements? Metabolic variables measured from a blood sample are related to metabolism and contain comprehensive information about different amino acids and lipid concentrations in the blood. Put simply, it’s a very detailed blood test. If we used the metabolic variables in addition to the clinical variables in the FINDRISC model, could the model discrimination ability be further improved? We investigated this in our recently published study.

How to build a risk prediction model?

At its core, a statistical model is basically just a linear combination of the included variables. Each variable is multiplied by their own coefficient and added together. Building a model refers to determining the (regression) coefficients values for each variable. Statistical models are built with development data, and model discrimination ability can be assessed with test data.

As the development dataset we used the Finnish FINRISK2002 cohort (around 5000 individuals). To assess the generalizability of the results, we leveraged multiple test datasets: the FINRISK cohorts from 1997, 2007 and 2012, as well as the 46-year collection of the NFBC1966 cohort based in Oulu (around 17000 individuals). Using multiple distinct and independent test datasets allows for an interesting framework, where we get to develop the models, and test their generalizability in the same study.

To start off with, we need a baseline model, something to compare against the better models. We constructed the baseline model using 15 clinical variables: age, sex, alcohol use, smoking, waist circumference, BMI, systolic and diastolic blood pressure, HDL (“good cholesterol”), total cholesterol, triglycerides, glucose, blood pressure medication, lipid-lowering medication, and family history of diabetes. The baseline model is thus based on variables that could be measured at a basic health checkup. The baseline model includes slightly different variables than the FINDRISC model and works very well in the test cohorts.

Altogether, more than 150 metabolic variables were included in the cohorts. A possible risk with a large set of explanatory variables is the possibility of overfitting, where the constructed model follows the structures of the development data too closely, and conversely is no longer able to generalize to test data. Different statistical methods can be used to avoid overfitting, such as model selection (= methods, that limit the number of variables selected in the model, e.g. stepwise regression), and/or penalization (= methods that limit the magnitude of the variable coefficients, e.g. ridge regression). Some methods do both model selection and penalization, such as LASSO regression or elastic net.

In addition, we tested whether adding the metabolic variables in addition to, or possibly instead of the clinicals makes a difference in terms of model performance. A total of 18 models were constructed and compared to the baseline. The model performance was assessed based on the discrimination ability (=does the models predict a higher risk for those who develop diabetes, than to those who don’t), and calibration (=are the predicted risks in-line with the observed probabilities of developing diabetes).

Metabolic variables did not improve over the baseline

The result of the study was somewhat negative: the metabolic variables were not able to provide improvement over the baseline in a way that would have any practical usefulness. In single cohorts, small statistically significant improvement in discrimination was observed with some models, but this did not generalize to all the other cohorts. No clear improvement was seen in terms of calibration either.

Why did the metabolic variables not improve over the baseline?

Beforehand, we thought that some improvement would be found; previous studies have found associations between many of the metabolic variables and the risk of type 2 diabetes. Why then did the inclusion of the metabolic variables not lead to clearly improved models?

The main issue is the multicollinearity of the data. The metabolic variables are highly correlated with the clinical variables, i.e. the metabolic and clinical variables are too similar to each other in some way. The metabolic variables do not include enough new information to clearly improve the model performance over the baseline. Many of the metabolic variables could be used instead of some of the clinical variables, but this does not translate to improved performance of the models.

Author:

Eetu Kiviniemi

Research article: Kiviniemi et al. 2025. Developing risk prediction models for type 2 diabetes and assessing the role of circulating metabolic biomarkers in five independent Finnish cohorts with over 22,000 individuals. Journal of Clinical Epidemiology, 188, 111978.

Created 31.10.2025 | Updated 6.2.2026

Using metabolomics data in predicting diabetes risk

What is a risk prediction model?

A better risk prediction model?

How to build a risk prediction model?

Metabolic variables did not improve over the baseline

Why did the metabolic variables not improve over the baseline?

Decoding health and disease

Mendelian randomization: how does genetics mimic a randomized trial?

Body mass index on the scale

Fascinating world of genes

Publication Bias in Medical Research

Postal address

Street address