Argos-train can't find NVIDIA driver on GPU

Hello,

Argos-train does not find the NVIDIA driver installed on my GPU system and returns the following error message when run:

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Moreover, argos-train gives a list index out of range error message at the end of a session:

IndexError: list index out of range

The NVIDIA driver installation on my GPU PC is confirmed when running the lsmod | grep nvidia command. However, I cannot figure out whether this IndexError message is due to the NVIDIA RuneTime Error or not.

Any assistance or hint helping me to solve this issue would therefore be very much appreciated.

1 Like

I’m not sure what the issue could be. What is the output of nvidia-smi?

What’s the stack trace for the list out of range error?

Thank you for your reply, argosopentech. Here is is the output of the nvidia-smi:

$ nvidia-smi
Wed Mar 8 08:19:04 2023
±----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01 Driver Version: 515.86.01 CUDA Version: 11.7 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … Off | 00000000:01:00.0 On | N/A |
| 30% 28C P8 10W / 220W | 10MiB / 8192MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

As you can see, it finds no running process, even though conda is installed (but not anaconda) on my machine. Could this be part of the problem?

The stack trace of the out of range error is:

"
Traceback (most recent call last):
File “/home/argosopentech/env/bin/onmt_train”, line 33, in
sys.exit(load_entry_point(‘OpenNMT-py’, ‘console_scripts’, ‘onmt_train’)())
File “/home/argosopentech/OpenNMT-py/onmt/bin/train.py”, line 172, in main
train(opt)
File “/home/argosopentech/OpenNMT-py/onmt/bin/train.py”, line 157, in train
train_process(opt, device_id=0)
File “/home/argosopentech/OpenNMT-py/onmt/train_single.py”, line 64, in main
configure_process(opt, device_id)
File “/home/argosopentech/OpenNMT-py/onmt/train_single.py”, line 19, in configure_process
torch.cuda.set_device(device_id)
File “/home/argosopentech/env/lib/python3.10/site-packages/torch/cuda/init.py”, line 326, in set_device
torch._C._cuda_setDevice(device)
File “/home/argosopentech/env/lib/python3.10/site-packages/torch/cuda/init.py”, line 229, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Traceback (most recent call last):
File “/home/argosopentech/env/bin/argos-train”, line 7, in
exec(compile(f.read(), file, ‘exec’))
File “/home/argosopentech/argos-train/bin/argos-train”, line 20, in
train.train(from_code, to_code, from_name, to_name, version, package_version, argos_version, data_exists, epochs_count)
File “/home/argosopentech/argos-train/argostrain/train.py”, line 173, in train
str(opennmt_checkpoints[-2].f),
IndexError: list index out of range
"
Does this help?

1 Like

It looks like the out of range error happens because the training fails so the Nvidia driver issue is the root cause.

I’m not sure what the problem with recognizing the Nvidia drivers is. You can try posting on the OpenNMT forum or the PyTorch Discourse.

Thank you for your feedback. I also thought at first that the NVIDIA driver was the cause of the problem, but as I already mentioned it in my first message, the driver is correctly installed on my machine. This can be ckecked out with the lsmod | grep nvidia command, which gives the following output:

"
$ lsmod | grep nvidia
nvidia_uvm 1347584 0
nvidia_drm 73728 2
nvidia_modeset 1146880 1 nvidia_drm
nvidia 40857600 2 nvidia_uvm,nvidia_modeset
drm_kms_helper 200704 1 nvidia_drm
drm 581632 6 drm_kms_helper,nvidia,nvidia_drm
"
I will also query the OpenNMT and PyTorch forums, as you suggested, and thank you again for your kind assistance, closing this issue in the meantime.

1 Like