Data Preprosessing

Properties of health data, anonymization and pseudonymization, data cleaning and transformation, and data reduction for machine learning

Properties of health data

The adoption of Kanta service has enforced healthcare provides to record many of the health data variables in a structured mode. Structured data attributes include e.g. diagnoses (with ICD-10 codes), patient risks, medication data (with ATC codes) etc. On the other hand, patient journal is recorded in an unstructured text format, making data mining and analysis harder than with structured data.

Quality issues with health datasets are the same as with any other datasets. The most common problems are

  • lacking observations of certain variables of interest

  • incorrect data values (“noise”)

  • redundant data (the same data value has been recorded several times)

  • data variables were recorded in different scales in different studies

  • only aggregate data (limits full potential for analytics, especially predictions)

  • different formats

  • different names for the same variables

Data preprocessing techniques improve the quality of data and can convert prior useless data into a new useful dataset. However, the deployment of preprocessing requires some understanding of the healthcare data variables to be used in a meaningful way.

Anonymization and pseudonymization

Pseudonymization is a procedure by which personally identifiable information, such as name and social security number are replaced by artificial identifiers, pseudonyms. After pseudonymization, data can no longer be attributed to a specific person without the use of additional information, which should be kept carefully separate from the personal data.

Pseudonymization is generally regarded as an insufficient de-identification method for confidential health data. Therefore, anonymization is needed to protect the privacy of an individual and to make re-identification from health data impossible – or at least very difficult. Anonymization refers to the processing of personal data in such a way that the process makes it permanently impossible to identify individuals from them.

There is a panoply of different anonymization techniques available. However, the GDPR Article 29 Working Party’s Opinion 05/2014 provided the following shortlist of anonymization method families to be considered as the most suitable for protecting individual’s privacy:

  • randomization methods

  • generalization methods

Randomization can be achieved by two different approaches: 1) specific identifiers are pulled from the dataset, thereby losing their association with other values in the original row in the dataset or 2) by replacing original data items with randomly generated data.

In generalization, data values such as age (46) and home town (“Kempele”) are replaced with more general values like age range 45-50 years and “Oulu region”.

The GDPR Article 29 Working Party’s Opinion 05/2014 discusses also noise addition, permutation, differential privacy, aggregation, k-anonymity, l-diversity, and t-closeness as potential methods for anonymization. It is outside the scope of this paper to discuss their strengths and weaknesses, but as an overall conclusion, it can be stated that anonymization methods should be chosen with expertise. It is often a compromise between data utility versus privacy that should be evaluated case-by-case.

A thorough description of the anonymization methods and recommendations can be found in

Data cleaning and transformation

Data is cleaned by correcting or removing incorrect data values, such as correcting typographical errors or validating and correcting incorrect data values against a known list of correct ones. Other methods of data cleaning include e.g.

  • removing duplicate values

  • data enhancement, where data is made more complete by adding related information. For example, appending complete diagnosis names to ICD-10 codes

  • normalization/scaling: adjusting values measured on different scales to a notionally common scale

  • harmonization: combining multiple data sources into an integrated, unambiguous dataset

Data cleaning is often a time-consuming phase, which can be speeded up by using appropriate software, such as Matlab’s preprocessing tools ( or some of the open access tools listed in

Data transformation is often needed when multiple datasets are integrated into one large, combined dataset.

It refers to the process of converting data from the format of a source file into the required format of a destination system. The process can be complex or simple based on the scope of required changes.

Data reduction for machine learning

Data collection from genomic DNA sequencing, health devices and large-scale networks results up in extremely large datasets called big data. Effective handling of the big data stream to store and analyze it in a meaningful way is a key challenge. Data reduction methods for big data are necessary because large number of variables cause curse of dimensionality in a dataset, requiring unbounded computational resources to discover meaningful knowledge and recognizable patterns in the data. The goal of data reduction is to reduce the dataset volume but still produce similar or even better analytical results than with the original dataset.

Dimension reduction is the key element in data reduction for machine learning analytics. It is a process of extracting a set of principal variables from the dataset under consideration. It can be done in two different ways:

  • by keeping only the most relevant variables from the original dataset (feature selection) or

  • by making combinations of the original variables and finding which combinations contain the same information as the original dataset (feature extraction)

There are many commonly used dimension reduction algorithms available and some are even built-in the machine learning algorithms doing the actual data analysis, such as Random Forest. A good overview of dimension reduction methods can be found in