Decoding health and disease
The purpose of this blog is to present, discuss and share knowledge on topics related to population health and medical research. Blog posts are written mainly by researchers from the Faculty of Medicine.
The FINDRISC risk prediction model, developed in Finland in 2003, is a statistical model that calculates the risk, or the probability, of developing type 2 diabetes within the next five years. The model is based on a small set of clinical and questionnaire type variables, such as sex, age, body mass index, or elevated glucose levels determined by a doctor. The model is widely used internationally, and you can test it yourself on the Finnish Diabetes Association website. Despite the small set of explanatory variables, the model is able to effectively discriminate between high and low risk persons.
Could the model be improved by leveraging more accurate measurements? Metabolic variables measured from a blood sample are related to metabolism and contain comprehensive information about different amino acids and lipid concentrations in the blood. Put simply, it’s a very detailed blood test. If we used the metabolic variables in addition to the clinical variables in the FINDRISC model, could the model discrimination ability be further improved? We investigated this in our recently published study.
At its core, a statistical model is basically just a linear combination of the included variables. Each variable is multiplied by their own coefficient and added together. Building a model refers to determining the (regression) coefficients values for each variable. Statistical models are built with development data, and model discrimination ability can be assessed with test data.
As the development dataset we used the Finnish FINRISK2002 cohort (around 5000 individuals). To assess the generalizability of the results, we leveraged multiple test datasets: the FINRISK cohorts from 1997, 2007 and 2012, as well as the 46-year collection of the NFBC1966 cohort based in Oulu (around 17000 individuals). Using multiple distinct and independent test datasets allows for an interesting framework, where we get to develop the models, and test their generalizability in the same study.
To start off with, we need a baseline model, something to compare against the better models. We constructed the baseline model using 15 clinical variables: age, sex, alcohol use, smoking, waist circumference, BMI, systolic and diastolic blood pressure, HDL (“good cholesterol”), total cholesterol, triglycerides, glucose, blood pressure medication, lipid-lowering medication, and family history of diabetes. The baseline model is thus based on variables that could be measured at a basic health checkup. The baseline model includes slightly different variables than the FINDRISC model and works very well in the test cohorts.
Altogether, more than 150 metabolic variables were included in the cohorts. A possible risk with a large set of explanatory variables is the possibility of overfitting, where the constructed model follows the structures of the development data too closely, and conversely is no longer able to generalize to test data. Different statistical methods can be used to avoid overfitting, such as model selection (= methods, that limit the number of variables selected in the model, e.g. stepwise regression), and/or penalization (= methods that limit the magnitude of the variable coefficients, e.g. ridge regression). Some methods do both model selection and penalization, such as LASSO regression or elastic net.
In addition, we tested whether adding the metabolic variables in addition to, or possibly instead of the clinicals makes a difference in terms of model performance. A total of 18 models were constructed and compared to the baseline. The model performance was assessed based on the discrimination ability (=does the models predict a higher risk for those who develop diabetes, than to those who don’t), and calibration (=are the predicted risks in-line with the observed probabilities of developing diabetes).
The result of the study was somewhat negative: the metabolic variables were not able to provide improvement over the baseline in a way that would have any practical usefulness. In single cohorts, small statistically significant improvement in discrimination was observed with some models, but this did not generalize to all the other cohorts. No clear improvement was seen in terms of calibration either.
Beforehand, we thought that some improvement would be found; previous studies have found associations between many of the metabolic variables and the risk of type 2 diabetes. Why then did the inclusion of the metabolic variables not lead to clearly improved models?
The main issue is the multicollinearity of the data. The metabolic variables are highly correlated with the clinical variables, i.e. the metabolic and clinical variables are too similar to each other in some way. The metabolic variables do not include enough new information to clearly improve the model performance over the baseline. Many of the metabolic variables could be used instead of some of the clinical variables, but this does not translate to improved performance of the models.
Author:
Eetu Kiviniemi
Research article: Kiviniemi et al. 2025. Developing risk prediction models for type 2 diabetes and assessing the role of circulating metabolic biomarkers in five independent Finnish cohorts with over 22,000 individuals. Journal of Clinical Epidemiology, 188, 111978.
The purpose of this blog is to present, discuss and share knowledge on topics related to population health and medical research. Blog posts are written mainly by researchers from the Faculty of Medicine.