Sources of Health Data

Finland has an extensive network of national health data registries, biobanks, and research centers, which collect and manage large datasets.  Other data sources include open access datasets and real-world data from customers and companies. This chapter describes their data contents and gives instructions on how to apply for access to it.  

Contents of this page:

Birth cohorts
Biobanks
Patient records
Regional open data
Open access datasets
Data collected by citizens
Data collected by companies
Real-world data
Augmented (synthetic) data
 

Birth cohorts

Birth cohorts comprise collections of data and biological samples from large population studies followed over long time periods, enabling studies of lifelong health and roles of genetic and environmental factors across several generations. The datasets have been collected solely for research purposes, with general interest in public health and longitudinal studies.

The Finnish national maternity cohorts 1987 and 1997 comprise all children born in Finland during these years, as well as their biological parents. The dataset is very large, comprising over 228 009 persons. The data has been gathered from the birth register maintained by the Institute of Health and Welfare (THL) and combined with data from other registers.

The Northern Finland Birth Cohorts, NFBC1966 and NFBC1986 data has been collected at regular intervals from >20 000 persons.  The data consists of health care records, questionnaires and clinical examinations as well as data on their parents and offspring (total 70 000).  Also two cohorts from aging individuals, Oulu35 and Oulu45, and a population study The Young in Northern Finland contribute to the cohort data.

How to get the data from birth cohorts

National birth cohort data, see  https://thl.fi/fi/tutkimus-ja-kehittaminen/tutkimukset-ja-hankkeet/kansallinen-syntymakohortti-1987/using-the-1987-fbc-data

Northern Finland birth cohort data, see instructions in https://www.oulu.fi/nfbc/node/47960

Note that these data cannot be transferred or handed over to countries outside the EU or ETA. The same principle concerns also biobank data.

Effects of Secondary Use of Health and Social data Act

None. Research permits for birth cohort data will be requested as before. Cohort data’s primary use is research, and its use is based on consent of the participants and the research program of the cohorts.

 

Biobanks

The purpose of the biobanks is to serve future research and product development. Biobanks collect and store biological samples (tissue, blood, pathological samples) combined with health data of the donors, who have given consent to this.

Current Finnish biobanks have been established by several actors: health care districts, universities, Institute of Health and Welfare (THL) and a private sector healthcare company Terveystalo.  Together they form The Finnish Biobank Cooperative (FinBB), which provides infrastructure for data processing, law, and communications. It also acts as a single point of entry to biobank materials for academic and industrial researchers.

Within the near future, genomic data from the sample donors will be included in the biobank datasets. The DNA samples are sequenced and analyzed by the Finngen project, which aims to produce close-to-complete genome variant data of 500 000 individuals from Finland and combine it with individual health data from several national registries.  After completion of the study, the genomic data produced during the project will be owned by the Finnish biobanks and remains available for researchers and companies.

How to get the data from biobanks

List of Finnish biobanks and contact information for requests

It is also possible to reach all six Finnish hospital biobanks by completing just one feasibility and access request through one service. FINGENIOUSTM helps both researchers and businesses to investigate the availability of biobank samples and related data.

If you want to use data or material from the Biobank Borealis of Northern Finland, you can also contact it directly.

Effects of Secondary Use of Health and Social data Act

There will be a change in the research permit process after enforcement of this law, because Biobank Borealis is administrated by the Oulu university hospital.  If patient data from hospital is part of the intended research, the process described in chapter 2.3 will apply. Note that there will also soon be a renewed Biobank Act, which is currently under preparation.

 

Patient records

In Finland patient documents are created and updated by health care professionals in electronic format in private, public and occupational healthcare. The medical data include patient journal records, diagnoses, risk factors, laboratory test results, X-ray examinations, medicine prescriptions etc. Patient records are uploaded from local databases to the national Patient Data Repository (Kanta), from which citizens can view their own health records via My Kanta (OmaKanta) service. Recently also dental care organizations have joined or are currently joining the Kanta service.

Access to the the collected health records is controlled according to the legislation (Laki sosiaali- ja terveydenhuollon asiakastietojen sähköisestä käsittelystä). The health care professional can view and update it without the patient’s consent only if there is a care relationship to the patient. To ensure confidentiality, access to the Kanta records is possible only for authorized healthcare professionals with special smart cards (strong authentication).  Browsing of the patient data is monitored in log files from which it is possible to track who and when the data was accessed.

Medical records can be handed over to another health care organization only if patient has given consent to do this. Citizens can maintain the consents and refusals for their own health data in the MyKanta service.

How to get the data in patient records

Patient documents can be used for research only via a research request process defined by each healthcare organization.  To exemplify the process, the permit to use patient data in a typical clinical research may require a selection or all of the following documents:

  • research plan

  • decision from the funding authority

  • information sheet to the patient

  • consent from the patient

  • potential agreements for outsourced services (ostopalvelusopimukset)

  • consent from the organization’s Ethical Committee

  • potential other consents from national authorities, such as THL, Valvira etc.

  • Report of data policy (tietosuojaseloste)

Note that the required documents vary in different healthcare organizations and depend on the type of research (more information in the links listed below).

 The links above lists only university hospitals/districts. Contact other healthcare providers directly, as the research permit instructions are not necessarily described in their web pages.  

Effects of Secondary Use of Health and Social data Act

Major effect, as anonymized person-level data can be available for research and in aggregated form also for development and innovation. Findata is the data permit authority, operating at the National Institute for Health and Welfare (THL).  Public healthcare providers may also hand over their data permit authority to Findata, but it is not compulsory (depends on the organizations will to do that).

The Act applies to

  • health data stored in Kanta and Omakanta

  • register data from healthcare and welfare services  

  • data on social benefits and pensions (from KELA, Finnish Center for Pensions)

  • several national registries, including THL

  • basic information on persons and buildings (from Population Registration Center)

More detailed information in STM pages.

 

Regional open data 

City of Oulu provides open access dataset related to various categories, also to health and social services (metadata only in Finnish). 

City of Oulu Open Data
Open datasets

 

Open access datasets

While clinical and register data remain to be the most useful data sources for the scientists in the healthcare domain, open access datasets are also increasingly used, particularly in machine learning and machine vision projects which require vast amounts of training data.

Collections of open access datasets of healthcare are available in several web portals (see Table 3). Often Creative commons –license defines the use cases for deploying the dataset:

  • CC0 frees content globally with no restrictions

  • CC BY <name>, when the author/owner of data must be cited

  • CC NC =noncommercial

  • CC SA = share alike.  Licensees may distribute derivative works only under a license identical ("not more restrictive") to the license that governs the original work.

  • CC ND = no derivative works

When publishing your own dataset, check the suitable license type. Simple tool for selecting the right license type can be found in https://creativecommons.org/choose/ (in English) and https://creativecommons.fi/valitse/ (in Finnish).

Table 1. Collections of open access datasets for machine learning. Click the name to direct access to the dataset portal. 

Name

Description

Gengo.ai

18 free Life Sciences & Medical datasets

UCI Machine learning Repository

106 datasets in Life Sciences

Github

Large curated list of dataset collections for imaging, diagnoses, precision medicine, medical speech and health records etc.

Kaggle

Search for “health” yields 87 datasets

Bytescout

50 Big Data providers with one remarkable dataset from each provider

Bigmi

List of datasets to build predictive models

Enigma

Large collection with several categories, including health and imaging. Dataset sizes tend to be too small for machine learning

Openi

History of medicine collection, orthopedic illustrations

1000 Genomes Project

Detailed catalog of human genetic variation. Subset of the Amazon project

Find more open access health and genome datasets here.

Effects of Secondary Use of Health and Social data Act

None.

 

Data collected by citizens

Citizens are increasingly using digital and mobile devices, such as pedometers, glucose meters, sleep quality and heart-rate measuring instruments for monitoring their health and wellness status. There has been growing interest to use these data for helping service providers to create more targeted, preventive, and personalized solutions for healthcare.

The welfare data collected by consumers is nowadays mostly gathered to the internal memory of the device and/or cloud service of the device manufacturer. In Nordic countries some attempts have been made to construct central storage for wellness data gathered by citizens. One of the most notable is Kanta Personal Health Records (PHR), where it is possible to store data gathered from wellbeing applications. The applications are provided by third parties and must be approved by the Kanta Services after acceptance testing.

The up-to-date list of applications which use Kanta Personal Health Records can be found at www.kanta.fi/en/list-of-applications.

Effects of Secondary Use of Health and Social data Act

Only when the data collected by citizen is stored in Omakanta Personal Health Records (PHR).  Anonymized person-level data from Omakanta can be available for research, and in aggregated form also for development and innovation.

Findata is the data permit authority, operating at the National Institute for Health and Welfare (THL).

More about the secondary use at ministry webpage: stm.fi/en/secondary-use-of-health-and-social-data

 

Data collected by companies

Companies in the healthcare and wellness sector collect customer data and store it, making them data controllers. General Data Protection Regulation (GDPR) sets the legal limits and rules for this in the EU countries. According to the GDPR, companies can collect health and wellness data only if the customer has given an informed consent for it. At the time of collecting their data, customers must be informed about

  • who your company/organisation is (your contact details, and those of your DPO (Data Protection Officer, if any);

  • why your company/organisation will be using their personal data (purposes);

  • the categories of personal data concerned;

  • the legal justification for processing their data;

  • for how long the data will be kept;

  • who else might receive it;

  • whether their personal data will be transferred to a recipient outside the EU;

  • that they have a right to a copy of the data (right to access personal data) and other basic rights in the field of data protection (see complete list of rights);

  • their right to lodge a complaint with a Data Protection Authority (DPA);

  • their right to withdraw consent at any time;

  • where applicable, the existence of automated decision-making and the logic involved, including the consequences thereof

Source:  https://ec.europa.eu/info/law/law-topic/data-protection/reform/rules-business-and-organisations/principles-gdpr/what-information-must-be-given-individuals-whose-data-collected_en

When a company wants to give access for data to a third party (such as a research group at a university in collaboration projects), also this possibility must be stated in the informed consent to the customer. If this is not clearly stated, a new round of consent acquisition is needed.

Effects of Secondary Use of Health and Social data law

For innovation and development activities, companies will be able to receive ready-combined aggregate data more comprehensively and quickly through Findata.

See more also at Legislation concerning health data.

 

Real-world data

Real-world datasets can combine data from various sources, such as

  • data gathered from various consumer devices in the wellness sector (activity meters, pedometers, wearables etc.
  • data from the insurance/billing etc. systems
  • data from electronic health records
  • data from environmental or socioeconomical registries
  • Social media

There is no consensus on the definition of real-world data (RWD), but according to one, RWD contains everything else but the data gathered in randomized clinical trials. This definition is used especially in the pharmacological context.

Real-world evidence (RWE) and Real-world data (RWD) are playing an increasing role in healthcare decisions. FDA (Food and Drug Agency)in the US has produced several guides on how to use real-world data, see https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence

 

Augmented (synthetic) data

Augmented data is based on real dataset, but the number of data points is increased by using data augmentation techniques. The main use case for augmented data in healthcare is in the advanced machine learning analytics of radiological images, such as X-ray and magnetic resonance images (MRI) or computed tomography (CT) scans. Often the original dataset size is too small and does not capture enough variation.  Data augmentation is therefore needed to get better generalization in the model. The typical goal is to train a model which can do classification tasks, such as recognizing malignant tumors in MRI/CT scans or identifying eye diseases in the retinal images.

Some traditional data augmentation techniques for image data include

  • rotating the image
  • flipping the image on vertical or horizontal axis
  • cropping a section from the image and resizing it
  • random color manipulation

An advanced data augmentation technique for image analysis is using GAN (Generative Adversial Network), which is a deep learning neural network system in machine learning. As with other augmentation techniques, GAN starts from the original image data to generate more data points for the training set of the neural network. Given a training set, this technique learns to generate new data with the same statistics as the training set. For example, a GAN trained on photographs can generate new photographs that look real to the human eye.

GAN is composed of two competing deep neural networks: generative network produces random synthetic candidate, such as images, while the discriminative network evaluates them against the desired classification goal. These two neural networks contest with each other, while the hoped end result would be a generator network that produces realistic outputs.

GAN is a fairly recently invented approach in machine learning (2014 by Goodfellow et al.), but it has rapidly gained popularity for various classification problems, including medical imaging. See more of GAN technique in:

https://www.sciencedirect.com/science/article/pii/S0925231218310749

https://ieeexplore.ieee.org/abstract/document/8363576

 

____

This Guide is made for informative purposes only. The Guide should not be used for legal guidance or should not be considered as a legal advice or as interpretation of any existing legislation.

Creative Commons license CC BY

Last updated: 26.3.2021