This is just a way to abstract the data, to allow us to use the same function calls inside of the neural networks regardless of the input data type. Its supposed to help parallelise preprocessing and training. But in my experience, it is better to do the preprocessing once and save the preprocessed data and use it directly. On the fly preprocessing is slow.
How to get the data in the training/test methods
Let's start from the end.
Iterate over the data
for batchnumber, (X, y) in enumerate(dataloader)
...
Create a custom Dataset:
If we can generate the data
Using the function: {python} generate_data()
to get the tuple (input_data, label)
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
def __init__(self, size):
self.size = size
def __len__(self):
return self.size
def __getitem__(self, index):
input_data, label = generate_data()
return input_data, label
my_dataloader = DataLoader(CustomDataset(size = 1000), batch_size=32, shuffle=True)
However I still recommend generating the X, y yourself. It is simply easier to handle and not every library is able to utilise Datasets.
If we have a list of file paths
# these files have both image and label data.
filepaths = ["file_1.nib", "file_2.nib", ...]
You will want to divide the files into Training, Validation and Testing. You then initialise on dataset and dataloader for each one.
Dividing:
from sklearn.model_selection import train_test_split
filepaths_train, filepaths_non_train = train_test_split(filepaths, test_size=VAL_SPLIT_PERCENTAGE + TEST_SPLIT_PERCENTAGE)
filepaths_validation, filepaths_test = train_test_split(filepaths_non_train, test_size=TEST_SPLIT_PERCENTAGE / (TEST_SPLIT_PERCENTAGE + VAL_SPLIT_PERCENTAGE))
import nibabel as nib
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
def __init__(self, filepaths):
self.filepaths = filepaths
def __len__(self):
return len(self.filepaths)
def __getitem__(self, idx):
# Load image from file path
current_training_filepath = self.filepaths[idx]
data = nib.load(current_training_filepath)
image = data["image"]
label = data["label"]
# any on the fly transformations go here
...
# don't forget to transform the data into torch tensors and put it into the right shape
...
return image, label
training_dataloader = DataLoader(CustomDataset(filepaths_train), batch_size=32, shuffle=True)
validation_dataloader = DataLoader(CustomDataset(filepaths_validation), batch_size=8, shuffle=False)
test_dataloader = DataLoader(CustomDataset(filepaths_test), batch_size=8, shuffle=False)