real acceleration comes from numpy accelerations

Select device on which the training is running:

# define which device is used for training

if torch.cuda.is_available():
	device = "cuda"
elif torch.backends.mps.is_available():
	device = "mps"
else:
	device = "cpu"

model.to(device) # do this later if the model is defined later.
torch.set_default_device(device)

print(f"Using {device} device. Every tensor created will be by default on {device}")

Mps device: enables high performance training on GPUs for Macos devices with the metal programming framework. I think that means m1 or newer, not sure.

avoid unecessary synchronization of data between devices.

torch.tensor(array, device=device) # or set the default device for all torch tensors.

instead of

torch.tensor(array).to(device)

If possible, add the entire dataset. Gpu's have a lot of Vram, most datasets can get stored there.

if you have a dataloader object it is often not possible to move the entire dataloader. You'll have to move it in the batch loops

Where to place the .to(device)?

It is common practice that the model only takes care of the math, nothing else. Input is a tensor on the right device, output is a tensor as well.

~~The .to(device) therefore needs to be placed in the training and test loop respectively. Is it worth it to create your own function for it like dominik did? No.~~

You are coding models. there is no reason to create or run torch tensors anywhere else than on device.

{python} torch.set_default_device(device) right at the start of the file/main function

What if I need to work with the respective outputs?

This will be debugging or showcase stuff where performance is less important than interpretability and speed of development. Therefore any potential performance increases by using a gpu are not worth it anymore.

This should happen after the model has done training. So there is a first "normal" part and then an analysis part of the code. Maybe even inside of the same file.

If you move to device inside of the model itself this will not work. So don't do it.

model.cpu()
torch.set_default_device("cpu")
# move any data you might need to cpu as well.
# don't forget to detach the data too
# model_output.detach()
with torch.no_grad():
	model_output = model(some_input)
	model_output.detach()

with torch.no_grad() we tell the model not to compute any gradients. Makes it perform better. Not sure if its worth the extra lines of code.

Is it pretty? No. But it is not something you should ever do outside of research or debugging functions. Therefore I consider it okay.

Early stopping

Consider implementing Early Stopping (patience) early on, to reduce the overall training time. Be aware that your performance statistics will be less reliable though, and you should consider using a separate test set.

Utilize all GPUs:

One option is to use Ray Clustering. It has the advantage that it only adds code, and shouldn't modify existing code.