Deep Learning based medical image segmentation — Part 5 — Data Preprocessing
Before we begin training the deep learning model, we need to preprocess the MRI image and labels dataset. Most of this work, such as image target spacing, image resampling, and intensity normalization is performed for us by nnUNet. However, on the segmentation maps (label files), nnUNet expects consecutive integers for labels. We will need to look at the training label files that we downloaded from TCIA and see if anything needs to be changed.
Segmentation Map (Label) File Preprocessing
Let’s take a look at a tumor segmentation file. For this, we will use the ITKSnap software. I am doing this on my Windows laptop.
After installing ITKSnap, open a training image by File → Open Main Image. Here, I opened images_structural\UPENN-GBM-00006_11\UPENN-GBM-00006_11_T1.nii.gz
Now apply the segmentation file on top by Segmentation → Open Segmentation. Here I opened patient # 6’s ground truth tumor segmentation file images_segm\UPENN-GBM-00006_11_segm.nii.gz
Reading the paper published with this dataset¹, we know that there are 3 labels —
- Necrotic Tumor Core (NCR) — Red area in the image
- Enhancing Tumor (ET) — Yellow area in the image
- Peritumoral Edematous/Infiltrated tissue (ED) — Edema for short, green area in the image
Each of these areas will have an integer value associated with it, and nnUNet requires them to be consecutive integers, with the black background having a value of zero.
Now, let’s open the segmentation file with Python and see what the integer labels for each of these segmented areas are.
- Open a new Colab notebook — 02_t501_glio_process_labels.ipynb
- Mount the Google Drive
from google.colab import drive
drive.mount('/content/drive')
- Import the required packages. We will use nibabel package to process the segmentation files.
import nibabel as nib
import numpy as np
import os
import pathlib
- Set up the file path and make sure you can access the segmentation files
base_img_path = '/content/drive/MyDrive/TCIA/nnUNet/nnUNet_raw/Dataset501_Glioblastoma/labelsTr/'
tr_label_files = os.listdir(base_img_path)
tr_label_files.sort()
# print(tr_label_files)
print(len(tr_label_files))
- Check the label values for each training image. This is going to print the min and max label values, and also the actual unique label values.
for file in tr_label_files:
file_name = base_img_path + file
# Load image
img = nib.load(file_name)
print("-"*100)
print ("Patient label file: ", file)
# Store image as a numpy array
img_data = img.get_fdata()
# Check array min, max and unique values for image label values
print("Before label check")
print(np.amin(img_data),np.amax(img_data))
print(np.unique(img_data))
----------------------------------------------------------------------------------------------------
Patient label file: 9.nii.gz
Before label check
0.0 4.0
[0. 1. 2. 4.]
----------------------------------------------------------------------------------------------------
Patient label file: 21.nii.gz
Before label check
0.0 4.0
[0. 1. 2. 4.]
----------------------------------------------------------------------------------------------------
Patient label file: 29.nii.gz
Before label check
0.0 4.0
[0. 1. 2. 4.]
----------------------------------------------------------------------------------------------------
- As we can see, the unique label values are 0, 1, 2 and 4. Since nnUNet needs these values to be consecutive, it will fail unless we update the label value 4 to 3. We can do that using nibabel. We will read the segmentation file, update the label value 4 to 3, and write the file back to our labelsTr folder.
for file in tr_label_files:
file_name = base_img_path + file
# Load image
img = nib.load(file_name)
print("-"*100)
print ("Patient label file: ", file)
# Store image as a numpy array
img_data = img.get_fdata()
# Check array min, max and unique values for image
print("Before label check")
print(np.amin(img_data),np.amax(img_data))
print(np.unique(img_data))
# Where label is 4, reset it to 3 so nnUnet doesnt complain about non-consecutive labels
img_data[(img_data == 4.0)] = 3
# Check min, max and unique values again
print("After label conversion")
print(np.amin(img_data),np.amax(img_data))
print(np.unique(img_data))
# Convert array back to Nii image
new_img = nib.Nifti1Image(img_data, img.affine, img.header)
# Set up new image path
new_img_path = base_img_path + file
print ('Saved to: ' + new_img_path)
# Save image to out_dir_path
nib.save(new_img, new_img_path)
----------------------------------------------------------------------------------------------------
Patient label file: 100.nii.gz
Before label check
0.0 4.0
[0. 1. 2. 4.]
After label conversion
0.0 3.0
[0. 1. 2. 3.]
Saved to: /content/drive/MyDrive/TCIA/nnUNet/nnUNet_raw/Dataset501_Glioblastoma/labelsTr/100.nii.gz
----------------------------------------------------------------------------------------------------
Patient label file: 102.nii.gz
Before label check
0.0 4.0
[0. 1. 2. 4.]
After label conversion
0.0 3.0
[0. 1. 2. 3.]
Saved to: /content/drive/MyDrive/TCIA/nnUNet/nnUNet_raw/Dataset501_Glioblastoma/labelsTr/102.nii.gz
----------------------------------------------------------------------------------------------------
Patient label file: 105.nii.gz
Before label check
0.0 4.0
[0. 1. 2. 4.]
After label conversion
0.0 3.0
[0. 1. 2. 3.]
Saved to: /content/drive/MyDrive/TCIA/nnUNet/nnUNet_raw/Dataset501_Glioblastoma/labelsTr/105.nii.gz
----------------------------------------------------------------------------------------------------
- We are now done with the label file preprocessing and can move on to the nnUNet preprocessing steps.
Generating metadata about our dataset
- Before we begin using nnUNet, we need to create a metadata (information about the data) file that will provide nnUNet with some basic configuration parameters. This file has to be called dataset.json. The nnUNet package provides a handy utility to generate this. Let’s generate this file now.
- Open a new Colab notebook — 03_t501_glio_gen_dataset_json.ipynb
- Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')
- Install the batchgenerators Python package, which is required by nnUNet. This is a pip command we are running on the Google Colab virtual machine (VM), so it needs to be prefixed by an !
!pip install batchgenerators
- Install the nnUNet V2 package
!pip install nnunetV2
- Setup the parameters required by the metadata file generation utility. We will provide the channel names for our MRI sequences, and also the label value mapping information.
# Set up all the parameters required by the metadata generation utility
output_folder = '/content/drive/MyDrive/TCIA/nnUNet/nnUNet_raw/Dataset501_Glioblastoma'
channel_names = {0: 'T1', 1: 'T1GD', 2:'T2', 3: 'FLAIR'}
labels = {'background': 0, 'ET': 1, 'NCR': 2, 'ED': 3}
file_ending = '.nii.gz'
region_class_order = {
'background': 0,
'whole tumor': (1, 2, 3),
'tumor core': (2, 3),
'enhancing tumor': 3
}
num_training_cases = 117
dataset_name = 'UPENN-GBM'
description = 'UPENN-GBM tumor segmentation'
- Import the metadata generation utility and run it to produce the dataset.json file
from nnunetv2.dataset_conversion.generate_dataset_json import generate_dataset_json
generate_dataset_json(output_folder=output_folder,channel_names=channel_names, labels=labels,
file_ending=file_ending,region_class_order=region_class_order,
num_training_cases=num_training_cases, dataset_name=dataset_name, description=description)
- You should now see a dataset.json file in MyDrive/TCIA/nnUNet/nnUNet_raw/Dataset501_Glioblastoma folder. It should look something like this.
{
"channel_names": {
"0": "T1",
"1": "T1GD",
"2": "T2",
"3": "FLAIR"
},
"labels": {
"background": 0,
"ET": 1,
"NCR": 2,
"ED": 3
},
"numTraining": 117,
"file_ending": ".nii.gz",
"name": "UPENN-GBM",
"description": "UPENN-GBM tumor segmentation",
"region_class_order": {
"background": 0,
"whole tumor": [
1,
2,
3
],
"tumor core": [
2,
3
],
"enhancing tumor": 3
}
}
Verifying dataset integrity and preprocessing
- nnUNet checks all the required folder and file configurations are appropriate before model training. It also preprocesses the images to standardize and normalize them before model training. This is performed by the nnUNet_plan_and_preprocess utility. The first time you run this, you should also check the dataset integrity.
- Open a new Colab notebook — 04_t501_glio_verify_data_integrity.ipynb
- Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')
- Install the nnUNet v2 package. Since the Colab notebook attaches to a new VM, we have to install this package every time we create a new notebook, or update the runtime type.
!pip install nnunetv2
- Setup the envionment variables and run the dataset verification and preprocessing. Remember, the environment variables only apply to the context of this cell. It is not saved or apply to subsequent cells or commands.
import os
os.environ['nnUNet_raw'] = "/content/drive/MyDrive/TCIA/nnUNet/nnUNet_raw"
os.environ['nnUNet_preprocessed'] = "/content/drive/MyDrive/TCIA/nnUNet/nnUNet_preprocessed"
os.environ['nnUNet_results'] = "/content/drive/MyDrive/TCIA/nnUNet/nnUNet_results"
# Verify dataset integrity has to be executed only the first time you are pre-processing the data
# After successful plan and preprocessing
!nnUNetv2_plan_and_preprocess -d 501 --verify_dataset_integrity
Fingerprint extraction...
Dataset501_Glioblastoma
Using <class 'nnunetv2.imageio.simpleitk_reader_writer.SimpleITKIO'> as reader/writer
####################
verify_dataset_integrity Done.
If you didn't see any error messages then your dataset is most likely OK!
####################
Using <class 'nnunetv2.imageio.simpleitk_reader_writer.SimpleITKIO'> as reader/writer
100% 117/117 [02:29<00:00, 1.28s/it]
Experiment planning...
2D U-Net configuration:
{'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 105, 'patch_size': array([192, 160]), 'median_image_size_in_voxels': array([172., 137.]), 'spacing': array([1., 1.]), 'normalization_schemes': ['ZScoreNormalization', 'ZScoreNormalization', 'ZScoreNormalization', 'ZScoreNormalization'], 'use_mask_for_norm': [True, True, True, True], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': (2, 2, 2, 2, 2, 2), 'n_conv_per_stage_decoder': (2, 2, 2, 2, 2), 'num_pool_per_axis': [5, 5], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True}
Using <class 'nnunetv2.imageio.simpleitk_reader_writer.SimpleITKIO'> as reader/writer
3D fullres U-Net configuration:
{'data_identifier': 'nnUNetPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 2, 'patch_size': array([128, 160, 112]), 'median_image_size_in_voxels': array([140., 172., 137.]), 'spacing': array([1., 1., 1.]), 'normalization_schemes': ['ZScoreNormalization', 'ZScoreNormalization', 'ZScoreNormalization', 'ZScoreNormalization'], 'use_mask_for_norm': [True, True, True, True], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': (2, 2, 2, 2, 2, 2), 'n_conv_per_stage_decoder': (2, 2, 2, 2, 2), 'num_pool_per_axis': [5, 5, 4], 'pool_op_kernel_sizes': [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 1]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'unet_max_num_features': 320, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': False}
Plans were saved to /content/drive/MyDrive/TCIA/nnUNet/nnUNet_preprocessed/Dataset501_Glioblastoma/nnUNetPlans.json
Preprocessing...
Preprocessing dataset Dataset501_Glioblastoma
Configuration: 2d...
100% 117/117 [05:28<00:00, 2.81s/it]
Configuration: 3d_fullres...
100% 117/117 [05:25<00:00, 2.78s/it]
Configuration: 3d_lowres...
INFO: Configuration 3d_lowres not found in plans file nnUNetPlans.json of dataset Dataset501_Glioblastoma. Skipping.
- If you’ve followed along the journey so far, set up all your folders and files as described in the earlier articles, hopefully you will see a message like the above that your dataset integrity is most likely OK. If there are issues like file names or folder names not matching, or training images that do not have corresponding label files, this step will fail, and you will need to fix the error to proceed further.
- After verifying the dataset integrity, nnUNet will produce the training plans for 2d, 3d_fullres, and 3d_lowres configurations. Here, it skipped the 3d low resolution configuration due to th relatively small number of training images.
This was a long post, but I hope you are able to follow along. We are ready to begin model training now. We can sit back and let nnUNet do the work. We will continue the process in our next post.