Steps for using GPU - processes running out of memory

restore_from_checkpoint_folder = restore_from_checkpoint_folder) File "/root/github/Challenge/Task_1/fets_challenge/experiment.py", line 286, in run_challenge_experiment task_runner = copy(plan).get_task_runner(collaborator_data_loaders[col]) File "/root/setup/envs/venv/lib/python3.7/site-packages/openfl/federated/plan/plan.py", line 389, in get_task_runner self.runner_ = Plan.build(**defaults) File "/root/setup/envs/venv/lib/python3.7/site-packages/openfl/federated/plan/plan.py", line 182, in build instance = getattr(module, class_name)(**settings) File "/root/setup/envs/venv/lib/python3.7/site-packages/openfl/federated/task/runner_fets_challenge.py", line 43, in __init__ model, optimizer, train_loader, val_loader, scheduler, params = create_pytorch_objects(fets_config_dict, train_csv=train_csv, val_csv=val_csv, device=device) File "/root/setup/envs/venv/lib/python3.7/site-packages/GANDLF/compute/generic.py", line 55, in create_pytorch_objects ) = get_class_imbalance_weights(parameters["training_data"], parameters) File "/root/setup/envs/venv/lib/python3.7/site-packages/GANDLF/utils/tensor.py", line 357, in get_class_imbalance_weights loader_type="penalty", File "/root/setup/envs/venv/lib/python3.7/site-packages/GANDLF/data/ImagesFromDataFrame.py", line 200, in ImagesFromDataFrame subject.load() File "/root/setup/envs/venv/lib/python3.7/site-packages/torchio/data/subject.py", line 368, in load image.load() File "/root/setup/envs/venv/lib/python3.7/site-packages/torchio/data/image.py", line 498, in load tensor = torch.cat(tensors) RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 17856000 bytes. Error code 12 (Cannot allocate memory) ``` I do see the following in the log ``` Device requested via CUDA_VISIBLE_DEVICES: 0 Total number of CUDA devices: 1 Device finally used: 0 Sending model to aforementioned device Memory Total : 15.8 GB, Allocated: 0.3 GB, Cached: 0.3 GB Device - Current: 0 Count: 1 Name: Tesla V100-PCIE-16GB Availability: True ```" />

Hi, I have been trying to run the challenge script on GPU. Am using 1 V100 GPU with 16GB of memory, have CUDA_VISIBLE_DEVICES=0, and have set `devices = 'cuda'` in the [script](https://github.com/FETS-AI/Challenge/blob/main/Task_1/FeTS_Challenge.py#L536) . However I keep encountering the following, any idea what could be going wrong? Thanks Note: the script runs fine for `small_split.csv` but NOT for `partitioning_1.csv` ``` Traceback (most recent call last): File "FeTS_Challenge.py", line 568, in restore_from_checkpoint_folder = restore_from_checkpoint_folder) File "/root/github/Challenge/Task_1/fets_challenge/experiment.py", line 286, in run_challenge_experiment task_runner = copy(plan).get_task_runner(collaborator_data_loaders[col]) File "/root/setup/envs/venv/lib/python3.7/site-packages/openfl/federated/plan/plan.py", line 389, in get_task_runner self.runner_ = Plan.build(defaults) File "/root/setup/envs/venv/lib/python3.7/site-packages/openfl/federated/plan/plan.py", line 182, in build instance = getattr(module, class_name)(settings) File "/root/setup/envs/venv/lib/python3.7/site-packages/openfl/federated/task/runner_fets_challenge.py", line 43, in init model, optimizer, train_loader, val_loader, scheduler, params = create_pytorch_objects(fets_config_dict, train_csv=train_csv, val_csv=val_csv, device=device) File "/root/setup/envs/venv/lib/python3.7/site-packages/GANDLF/compute/generic.py", line 55, in create_pytorch_objects ) = get_class_imbalance_weights(parameters["training_data"], parameters) File "/root/setup/envs/venv/lib/python3.7/site-packages/GANDLF/utils/tensor.py", line 357, in get_class_imbalance_weights loader_type="penalty", File "/root/setup/envs/venv/lib/python3.7/site-packages/GANDLF/data/ImagesFromDataFrame.py", line 200, in ImagesFromDataFrame subject.load() File "/root/setup/envs/venv/lib/python3.7/site-packages/torchio/data/subject.py", line 368, in load image.load() File "/root/setup/envs/venv/lib/python3.7/site-packages/torchio/data/image.py", line 498, in load tensor = torch.cat(tensors) RuntimeError: [enforce fail at CPUAllocator.cpp:67] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 17856000 bytes. Error code 12 (Cannot allocate memory) ``` I do see the following in the log ``` Device requested via CUDA_VISIBLE_DEVICES: 0 Total number of CUDA devices: 1 Device finally used: 0 Sending model to aforementioned device Memory Total : 15.8 GB, Allocated: 0.3 GB, Cached: 0.3 GB Device - Current: 0 Count: 1 Name: Tesla V100-PCIE-16GB Availability: True ```

Created by ambrish
Thanks! I was using 120G CPU RAM which was insufficient it seems. 150G seems to be adequate.
This is CPU Memory Error (RAM). I get the same issues with 32GB RAM, which cannot even be solved by increasing the available space in the hard disk for virtual paging. I can run it on 128GB RAM, not sure if 64GB is enough.

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Steps for using GPU - processes running out of memory page is loading…