September 17, 2020
12 mins

Data preparation in healthcare: Part 1

The quality of data in machine learning algorithms

Machine Learning (‘ML’) algorithms are only as good as their data. These ML algorithms build models based on sample data, known as ‘training data,’ to make predictions or decisions. However, the quality of the training data as a result of data preparation and the methods used to train the algorithms limits their learning capacity.

Data science analysis centers around the design of an ML algorithm. To develop an ML algorithm, the Data Scientist requires that during data preparation it is cleaned of specific ‘artifacts’, or undesirable features that occur as a result of the preparative procedure. Data cleaning requires particular expertise related to each data modality (e.g., Remote Medical Imaging (‘RMIs’), Computed Tomography (‘CT’) scans, histology slides, and Electronic Health Records (‘EHRs’). These data cleaning and processing skills are a significant part of the Data Scientist’s expertise and require extensive knowledge of each data modality.

There are two significant issues caused by poor quality data: firstly lower model performance; and secondly induced bias, which indirectly harms performance.

In healthcare, data quality is especially fundamental as:

There are a wide variety of protocols, machines, and people involved, making data preparation and collection incredibly complicated

  • ML is new in healthcare. Therefore, existing data preparation protocols are not yet designed to cope with ML associated constraints;
  • Most ML datasets are retrospective cohorts, meaning the data was initially collected for one use, and then reused for ML analysis.
  • Healthcare is multi-centric. There are numerous hospitals and, therefore, heterogeneity.

In our series of blogs, Data Preparation in Healthcare, we present some examples that account for the diversity of data quality issues and data preparation practices that we encounter during our daily work in ML. In Part One, we focus on cleaning experimental healthcare data.

Cleaning experimental healthcare data

Collecting healthcare information often requires rigorous experimental procedure due to the following factors:

  • Healthcare data is difficult to access. For example, to examine the tissue on a histology (microscopy) slide, the tissue must first be removed through surgery.
  • Healthcare data can be challenging to extract. For instance, to examine the sample under the microscope, the histology sample must be removed and then prepared; this involves fixing, staining, and mounting the sample (see figure 1.)
  • Researchers need large quantities of information. Numerous samples are required for analysis as medical data is characterized by multiple dimension information (‘high-dimensional’). For more information, stay tuned for our upcoming blog post on transfer learning and the curse of high dimensionality.


Histology is a good way of illustrating the small variabilities in data preparation procedures. The process of preparing a histology slide for pathologist analysis requires many steps, as shown in figure 1. Below we discuss the possible data variabilities.

1. Markers or noise introduced during data preparation

Specimen preparation is the step that involves adding the biological tissue onto the glass slide. This step is a typical source of issues specifically related to unwanted ‘artifacts’ contaminating the slide, such as hairs, as seen in figure 2. It is also common to find the annotations of pathologists on the slides, as seen in figure 3.

A histology slide contaminated with hairs. Source

a histology slide contaminated with written markers. Source
  • Firstly, affecting the training performance of the ML Algorithm. The unnecessary information confuses the algorithm during training, thus reducing training performance. We know that no cells exist on the “R” marked above (see figure 3). Thus, we should remove the ‘R’ from the image before using it to train the algorithm; and
  • Secondly, introducing a source of bias. Imagine we want to train an algorithm to diagnose whether a tissue has a small or large cyst. There happen to be some annotations on each slide with a large cyst. In this case, the algorithm will incorrectly learn that any image with an annotation contains a large cyst. Therefore, it will not have learned anything related to the cyst itself, and the results will be meaningless. If we provide the algorithm with a slide with a large cyst, but without an annotation, the algorithm will fail.

Bias caused by annotations on slides.

Such a bias (as depicted in figure 4), with an annotation on each slide, is obvious and can easily be prevented by using a segmentation algorithm to separate the annotations from the tissue sample. However, in more complex cases, the source of bias is less clear. For example, some scanners will induce small distortions or grid-like structures invisible to us but easily recognizable by an ML algorithm. In this case, the algorithm will exploit these small changes as a bias in a similar fashion. In some cases, data corrupted beyond bias prevents data curation algorithms from accessing certain parts of the sample. Imagine, for example, that two slides have been folded (as in figure 5. below); thus, the information in the fold will be unusable for an ML algorithm, and it will be limited to the information visible on the slide.

A folded histology slide. Source

2. Data Preparation Slide Fixation Methods

The fixation step can introduce a similar source of bias. This step prevents any other biological reactions and preserves the biological tissue. It involves treating the tissue on the glass slide with either Formalin-Fixed Paraffin-Embedded (‘FFPE’) or by freezing the slides. There are advantages and disadvantages to each method. However, each technique maintains cellular structures differently. An algorithm trained on FFPE slides will not be able to transfer its performance to frozen tissues. The solution is not merely an algorithmic transformation. Therefore, the fixing method must be considered by the pathologist when selecting his dataset.  

FFPE slides (left) versus frozen slides (right). Source

3. Data Preparation Staining Methods

Finally, bias can be found due to the sample staining that is necessary to color, and thus interpret the sample. Each center will inevitably stain its slides differently, and each scanner will read coloration differently. Therefore, it is necessary to apply several steps of color correction to normalize the dataset and reduce those biases as much as possible.

Images from the Camelyon16 challenge dataset from two different medical centers highlighting the staining (color) differences.


In this first part of our ‘Data Preparation in Healthcare’ series, we looked at an example of how the preparation of data samples can impact the performance of Machine Learning algorithms in Histology. At Owkin, we have extensive experience working with histology samples, and our ‘Fit-for-AI’ datasets have been expertly curated by leading academics to eliminate bias. Next up, in Part Two of this three-part series, we will examine other data modalities where the quality of the preparation of each data sample is essential, specifically High-Throughput Screening and Radiology.

Pierre Manceron
Charles Maussion
Kathryn Schutte
Maxime He
Antonia Trower
Data preparation in healthcare: Part 1

No items found.
No items found.
No items found.