Deep Learning based medical image segmentation — Part 3 — Google Colab setup
In our previous post, we downloaded the UPenn Glioblastoma MRI dataset from TCIA and then uploaded the files to our Google drive. We will now set up the folder structure required to train our model.
In the images_structural folder, you can see that there are 671 folders belonging to 630 patients. Some have an _21 suffix, indicating it was a second scan. Each folder has 4 MRI files, indicating the MRI sequence — T1, T1-GD, T2, and FLAIR.
Our goal is to train a deep learning model that can segment the Glioblastoma tumor regions. We will compare the segmentation generated by our model to the ground truth segmentation that was created by expert radiologists. However, only 147 of the 630 patients have the expert ground truth images in the images_segm folder. We’ll need to address this prior to model training.
We are going to use nnUNet to perform the automated segmentation.
nnU-Net Overview
From the nnUNet website — “nnU-Net is a semantic segmentation method that automatically adapts to a given dataset. It will analyze the provided training cases and automatically configure a matching U-Net-based segmentation pipeline. No expertise required on your end! You can simply train the models and use them for your application.”
I highly recommend you read through the nnUNet website in detail to get an idea of its scope and capabilities. Here is an overview of the nnUNet folder configuration under my projects folder -
projects/nnUNet/nnUNet_raw/
├── Dataset501_Glioblastoma
├── Dataset505_Heart
├── Dataset510_Meningioma
- Each Dataset folder will further have the following sub-folders and files
Dataset501_Glioblastoma/
├── dataset.json
├── imagesTr
├── (imagesTs)
└── labelsTr
- imagesTr contains the images belonging to the training cases. nnU-Net will run pipeline configuration, training with cross-validation, as well as finding postprocessing and the best ensemble on this data.
- imagesTs (optional) contains the images that belong to the test cases. nnU-Net will not use this folder or the images inside to train, but this is a handy place to store your test images.
- labelsTr the images with the ground truth segmentation maps for the training cases. Do not include labels for the test dataset here, or you will run into issues.
- dataset.json contains metadata of the dataset.
- All images, including label files, MUST be in the NiFTI format (.nii.gz)
- Each patient may have multiple MRI sequences (T1, T1-CE/GD, T2, FLAIR)
- The label files must contain segmentation maps that contain consecutive integers, starting with 0 (0, 1, 2, 3, … n). 0 is considered background.
Setting up the folders in Google Drive
Thankfully, Google Colab runs on Linux virtual machines (VM) and allows us to enter Linux commands to make things much easier. You can do this by prefixing the commands with an exclamation mark (!) in the Colab cell, which indicates that it is an operating system command. We will run all our commands through the Colab interface when working with Google Drive folders. The Dataset ID I have chosen for this project is 501.
- Open a new Colab notebook — 00_t501_glio_folder_setup.ipynb
- Mount your Google Drive. This will give you access to your “MyDrive” folder
from google.colab
import drive drive.mount('/content/drive')
3. You can expand the File icon in the left frame of Colab to see what has been mounted.
Here’s the high level folder structure I want to create. The version 2 of nnUNet has greatly simplified the folder structure.
MyDrive/TCIA/nnUNet - # Main nnUNet folder
|-- nnUNet_raw # Raw files
| |-- Dataset501_Glioblastoma
| |-- imagesTr
| |-- imagesTs
| |-- labelsTr
|
|-- nnUNet_preprocessed # Raw files processed before training
|
|-- nnUNet_results # Training plans and results
|-- inference # Model inference output
|-- postprocessed # Final postprocessed segmentation
4. Run the “mkdir” Linux commands in your Colab cell to create this folder structure. The “-p” argument creates the folder only if it doesn’t exist, which is nice. Remember, these are OS commands, so prefix them with an !
!mkdir -p /content/drive/MyDrive/TCIA/nnUNet/nnUNet_raw
!mkdir -p /content/drive/MyDrive/TCIA/nnUNet/nnUNet_raw/Dataset501_Glioblastoma
!mkdir -p /content/drive/MyDrive/TCIA/nnUNet/nnUNet_raw/Dataset501_Glioblastoma/imagesTr
!mkdir -p /content/drive/MyDrive/TCIA/nnUNet/nnUNet_raw/Dataset501_Glioblastoma/imagesTs
!mkdir -p /content/drive/MyDrive/TCIA/nnUNet/nnUNet_raw/Dataset501_Glioblastoma/labelsTr
!mkdir -p /content/drive/MyDrive/TCIA/nnUNet/nnUNet_preprocessed
!mkdir -p /content/drive/MyDrive/TCIA/nnUNet/nnUNet_results
!mkdir -p /content/drive/MyDrive/TCIA/nnUNet/nnUNet_results/inference
!mkdir -p /content/drive/MyDrive/TCIA/nnUNet/nnUNet_results/postprocessed
And here’s how that looks in my Google Drive. You may have to refresh the page to see the new folder.
We now have set up the basic folder structure required for nnUNet to run on our Glioblastoma dataset. We need to make some big picture decisions before we proceed to the next step.
Train-Test Split — Unfortunately, only 147 patients out of the 630 have manually segmented ground truth files for us to compare with. We can take two approaches to our model development -
- Split the 147 patients in an 80:20 train:test proportion, or
- Keep aside a few patients (10 is as good a number as any) as your test cohort and train the model on the remaining patients so as to provide more training data for your model
- In this project, we will split the 147 patients into training and test subsets since the goal is to show you how to build the model. You can try either approach, or both approaches, and compare segmentation accuracy.
Validation subset — nnUNet uses a 5-fold cross-validation technique, which means we don’t need a separate validation dataset.
In our next post, we will see how to split the image dataset randomly into training and test cohorts, and create the metadata file, dataset.json for nnUNet to begin its processing. I hope you find this interesting and you will follow along.