Official docs: https://docs.ray.io/en/latest/index.html
Ray is a framework to scale machine learning applications across multiple devices/gpus. I am using it, because I can start a ray cluster of my training method, without having to modify my training methods (with some small exceptions)
Implementation
if you think that you're leaking memory somewhere, you. can set
{python}@ray.remote(num_gpus=1/max_processes_per_gpu, max_calls=1)
so that your workers get destroyed (and the os cleans up potential memory leaks) after each method run. Better would obviously be to not leak memory.import ray
# tweak this parameter depending on how much gpu memory you have and require for each run
max_processes_per_gpu = 6
@ray.remote(num_gpus=1/max_processes_per_gpu)
def ray_training_task(config):
return train_model(config) # the train_model task does not require any ray specific code and can still be run sequentially
def run():
ray.shutdown()
ray.init()
X, y = get_dataset()
config = {
"num_epochs": EPOCHS,
"X": X,
"y": y,
}
# here you could give different parameters if you want
performance_metric_refs = [ray_training_task.remote(config) for _ in range(100)]
performance_metrics = ray.get(performance_metric_refs)
train_losses_ageraged, val_losses_ageraged, val_accuracies_aggregated = evaluate_performance_metrics(performance_metrics)
This is a very basic implementation. X and y should probably be shared between processed using {python}X_ref = ray.put(X)
.