Official docs: https://docs.ray.io/en/latest/index.html

Ray is a framework to scale machine learning applications across multiple devices/gpus. I am using it, because I can start a ray cluster of my training method, without having to modify my training methods (with some small exceptions)

Implementation

if you think that you're leaking memory somewhere, you. can set {python}@ray.remote(num_gpus=1/max_processes_per_gpu, max_calls=1) so that your workers get destroyed (and the os cleans up potential memory leaks) after each method run. Better would obviously be to not leak memory.

import ray

# tweak this parameter depending on how much gpu memory you have and require for each run
max_processes_per_gpu = 6

@ray.remote(num_gpus=1/max_processes_per_gpu)
def ray_training_task(config):
	return train_model(config) # the train_model task does not require any ray specific code and can still be run sequentially

def run():
    ray.shutdown()
    ray.init()

    X, y = get_dataset()

    config = {
        "num_epochs": EPOCHS,
        "X": X,
        "y": y,
    }

	# here you could give different parameters if you want
    performance_metric_refs = [ray_training_task.remote(config) for _ in range(100)]
    performance_metrics = ray.get(performance_metric_refs)

    train_losses_ageraged, val_losses_ageraged, val_accuracies_aggregated = evaluate_performance_metrics(performance_metrics)

This is a very basic implementation. X and y should probably be shared between processed using {python}X_ref = ray.put(X).