This is just a way to abstract the data, to allow us to use the same function calls inside of the neural networks regardless of the input data type. Its supposed to help parallelise preprocessing and training. But in my experience, it is better to do the preprocessing once and save the preprocessed data and use it directly. On the fly preprocessing is slow.

How to get the data in the training/test methods

Let's start from the end.

Iterate over the data

for batchnumber, (X, y) in enumerate(dataloader)
	...

Create a custom Dataset:

If we can generate the data

Using the function: {python} generate_data() to get the tuple (input_data, label)

from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, size):
        self.size = size
    
    def __len__(self):
        return self.size
    
    def __getitem__(self, index):
        input_data, label = generate_data()
        return input_data, label

my_dataloader = DataLoader(CustomDataset(size = 1000), batch_size=32, shuffle=True)

However I still recommend generating the X, y yourself. It is simply easier to handle and not every library is able to utilise Datasets.

while above statement is true, I think I am experienced enough, that I can now utilise it, without being bothered by the added complexity.

If we have a list of file paths

# these files have both image and label data.
filepaths = ["file_1.nib", "file_2.nib", ...]

You will want to divide the files into Training, Validation and Testing. You then initialise on dataset and dataloader for each one.

Dividing:

shuffle is true per default.

from sklearn.model_selection import train_test_split

filepaths_train, filepaths_non_train = train_test_split(filepaths, test_size=VAL_SPLIT_PERCENTAGE + TEST_SPLIT_PERCENTAGE)
filepaths_validation, filepaths_test = train_test_split(filepaths_non_train, test_size=TEST_SPLIT_PERCENTAGE / (TEST_SPLIT_PERCENTAGE + VAL_SPLIT_PERCENTAGE))

import nibabel as nib
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, filepaths):
        self.filepaths = filepaths

    def __len__(self):
        return len(self.filepaths)

    def __getitem__(self, idx):
        # Load image from file path
        current_training_filepath = self.filepaths[idx]
        data = nib.load(current_training_filepath)

		image = data["image"]
		label = data["label"]

		# any on the fly transformations go here
		...

		# don't forget to transform the data into torch tensors and put it into the right shape
		...

        return image, label

training_dataloader = DataLoader(CustomDataset(filepaths_train), batch_size=32, shuffle=True)
validation_dataloader = DataLoader(CustomDataset(filepaths_validation), batch_size=8, shuffle=False)
test_dataloader = DataLoader(CustomDataset(filepaths_test), batch_size=8, shuffle=False)

How to get the data in the training/test methods

Create a custom Dataset:

If we can generate the data

If we have a list of file paths

If the data comes from a csv file