Deep Learning based medical image segmentation — Part 4— Data Setup
Now that we have downloaded the Glioblastoma MRI dataset from TCIA, saved them in our Google Drive, and created the base folders required by nnUNet, we are ready to set up the dataset for pre-processing and training.
First, we need to randomly split the dataset into training and test cohorts. A few points around the train:test split —
- We will not have a separate validation cohort since nnUNet utilizes a 5-fold cross-validation technique.
- I have not analyzed the image data to for any specific clinical characteristics (such as tumor size or grade) to determine if a random train:test split is the right approach, or if there needs to be a specific classification of images for better representation. For now, we will just go with a random split. We can analyze our approach later to see if it was the right one. Also, I believe Glioblastomas are all Grade 4 gliomas.
- The authors of the original paper have only released 147 ground truth expert segmentation maps out of the full 630 patient cohort. We will limit ourselves to this group of 147 patients so we can compare our model generated segmentation with the expert segmentation.
- I have randomly chosen 20% of the 147 patient cohort to be my test set. If you run my code, your randomized test set and eventual results may be different from mine. If you want to exactly duplicate my results, I will provide the test patient IDs and you can use the same set instead of randomly generating them.
Alright, let’s get into it. We need to setup four different sets of files in the right file name format, and move them to the right folders for nnUNet to process —
- Training segmentation files
- Test segmentation files — not used by nnUNet, just for better organization
- Training MRI scans
- Test MRI scans
Preparation
- Create a new Colab notebook — 01_t501_glio_data_setup.ipynb
- Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')
- Import necessary Python packages
import os
import random
Randomizing the training:test data cohorts
- The Ground Truth segmentation files will drive all our next steps. We have 147 files. Let’s get the patient ID number from the file names, and put them in a list. Then we can randomly select 20% of the list to be our test dataset, and 80% to be our training dataset. Based on the patient IDs selected, we will need to move the MRI scans and labels for those patients to the training or testing folders. Remember, if you run this code, your randomly selected test dataset will be different from mine. Also nnUNet will not use the ground truth segmentation (label) files in the test dataset for its processing.
# Read all the ground truth segmentation files into a list
segm_src_dir = '/content/drive/MyDrive/TCIA/UPENN-GBM/images_segm/'
segm_file_list = os.listdir(segm_src_dir)
# Shuffle the list so the order is randomized
random.shuffle(segm_file_list)
# Set up the training data ratio and split the list at that point
train_ratio = 0.8
elements = len(segm_file_list)
train_elements = int(elements * train_ratio)
# Create new list of segmentation files that will be used for training
# nnUNet does not need the segmentation files for the test dataset, but we will need it later for running our metrics
train_segm_list = segm_file_list[:train_elements] # Training set - From 0 - 80%
test_segm_list = segm_file_list[train_elements:] # Test set - From 80 - 100%
# Check how many elements and if they are randomized
print (train_segm_list)
len(train_segm_list)
Training segmentation (label) files:
- Use Python to generate Unix/Linux commands to move the training segmentation files to the right folders. This is a quick and dirty way of moving the right files to the right directories. You can look at the nnUNet code for doing this more elegantly using Python packages. nnUNet requires the segmentation files to be in a certain format — patientID.nii.gz. We will use Python print statement to generate the Unix file move commands.
tr_segm_target_dir = '/content/drive/MyDrive/TCIA/nnUNet/nnUNet_raw/Dataset501_Glioblastoma/labelsTr/'
# Filename is like this - UPENN-GBM-00428_11_segm.nii.gz
for file in train_segm_list:
tr_segm_file = file.split('_')[0] # returns UPENN-GBM-00428
patient_id = tr_segm_file.split('-')[2] # returns 00428
just_id = int(patient_id.split('.')[0]) # returns 428
print ('!mv ' + segm_src_dir + file + ' ' + tr_segm_target_dir + str(just_id) + '.nii.gz')
- Copy all the move commands and paste them in a new cell. Run the cell. All your traning label files are now moved to the labelsTr folder.
Test segmentation files
We can do the same for the segmentation files (label files) belonging to the test cohort, although nnUNet will not use it. It’s just a nice way of organizing all our files for later use.
# This is not required by nnU-Net, but I am setting it up to use later for our metrics
ts_segm_target_dir = '/content/drive/MyDrive/TCIA/nnUNet/nnUNet_raw/Dataset501_Glioblastoma/labelsTs/'
for file in test_segm_list:
ts_segm_file = file.split('_')[0] # Split the file name at the first underscore and discard the _11....
patient_id = ts_segm_file.split('-')[2] # Remove the string UPENN-GBM and return just the ID #
just_id = int(patient_id) # Get the patient ID # and remove leading zeroes
print ('!mv ' + segm_src_dir + file + ' ' + ts_segm_target_dir + just_id + '.nii.gz')
- Copy the move commands generated for the test segmentation files into a new cell and run them.
Training MRI scans
- Now that we have moved the segmentation files, let’s focus on the training MRI scans. They need to be moved into the imagesTr folder. nnUNet requires MRI scan files to end in a sequence that looks like — patientID_4-digit-MRI-sequence.nii.gz. For example, the raw scans for patient #428, which look like this — UPENN-GBM-00428_11_T1.nii.gz. This will need to be converted to 428_0000.nii.gz.
- Our mapping of the MRI sequence to nnUNet sequence is — T1:0000, T1-GD:0001, T2:0002, and FLAIR:0003. Now, let’s find the MRI scans for patients in our training cohort and move them to the right nnUNet folder.
# Set up source and target directories for training files
mri_img_src_dir = '/content/drive/MyDrive/TCIA/UPENN-GBM/images_structural/'
tr_img_target_dir = '/content/drive/MyDrive/TCIA/nnUNet/nnUNet_raw/Dataset501_Glioblastoma/imagesTr/'
# Loop through training patient list and generate the move commands
# Each patient has their own folder under "images_structural", that looks like this - UPENN-GBM-00428_11
for patient in train_segm_list:
patient_id = patient.split('.')[0] # We have the patient ID from the segmentation list
patient_dir = patient_id + '_11' # Add the "_11" and to get the MRI folder name
patient_dir_path = mri_img_src_dir + patient_dir # Generate full path of the images source directory
just_id = int(patient_id.split('-')[2]) # Get the patient ID # and remove leading zeroes
print ('!mv ' + patient_dir_path + '/' + patient_dir + '_T1.nii.gz ' + tr_img_target_dir + str(just_id) + '_0000.nii.gz')
print ('!mv ' + patient_dir_path + '/' + patient_dir + '_T1GD.nii.gz ' + tr_img_target_dir + str(just_id) + '_0001.nii.gz')
print ('!mv ' + patient_dir_path + '/' + patient_dir + '_T2.nii.gz ' + tr_img_target_dir + str(just_id) + '_0002.nii.gz')
print ('!mv ' + patient_dir_path + '/' + patient_dir + '_FLAIR.nii.gz ' + tr_img_target_dir + str(just_id) + '_0003.nii.gz')
- Copy the move commands into a new cell and run them.
Test MRI Scans
- Now do the same for the test MRI scans
# Set up source and target directories for training files
mri_img_src_dir = '/content/drive/MyDrive/TCIA/UPENN-GBM/images_structural/'
ts_img_target_dir = '/content/drive/MyDrive/TCIA/nnUNet/nnUNet_raw/Dataset501_Glioblastoma/imagesTs/'
# Loop through training patient list and generate the move commands
# Each patient has their own folder under "images_structural", that looks like this - UPENN-GBM-00428_11
for patient in test_segm_list:
patient_id = patient.split('.')[0] # We have the patient ID from the segmentation list
patient_dir = patient_id + '_11' # Add the "_11" and to get the MRI folder name
patient_dir_path = mri_img_src_dir + patient_dir # Generate full path of the images source directory
just_id = int(patient_id.split('-')[2]) # Get the patient ID # and remove leading zeroes
print ('!mv ' + patient_dir_path + '/' + patient_dir + '_T1.nii.gz ' + ts_img_target_dir + str(just_id) + '_0000.nii.gz')
print ('!mv ' + patient_dir_path + '/' + patient_dir + '_T1GD.nii.gz ' + ts_img_target_dir + str(just_id) + '_0001.nii.gz')
print ('!mv ' + patient_dir_path + '/' + patient_dir + '_T2.nii.gz ' + ts_img_target_dir + str(just_id) + '_0002.nii.gz')
print ('!mv ' + patient_dir_path + '/' + patient_dir + '_FLAIR.nii.gz ' + ts_img_target_dir + str(just_id) + '_0003.nii.gz')
- Copy the move commands into a new cell and run them.
Phew! Finally, we have setup the dataset in the right format, and moved it into the right folders. Now the fun begins. We can start running nnUNet to pre-process, train and predict tumor segmentation. More in the next article. Please follow along and let me know if you have any questions or face issues setting this up.