Monitoring¶
Contents
Overview¶
Before starting with remote monitoring features, it is highly recommended to go over the local monitoring features guide which is available here.
Genv remote features allow users and system administrators to provision GPU resources accross multiple machines.
Genv remote monitoring features allow users and system administrators to monitor the resources and usage accross multiple machines using the command genv remote monitor
.
Quick start¶
This is a guide to get started with remote monitoring features in Genv.
Prerequisites¶
First, you will need to install Genv on your local machine with the required packages for monitoring:
pip install genv[monitor]
Note
If you have already installed Genv without the monitor required packages, install them with:
pip install prometheus-client
Next, you will have to configure SSH access to remote GPU machines.
Go over the remote installation overview and understand how Genv remote features work. It is recommended to install Genv on the remote machines and configure their SSH daemons. However, you can also use Genv remote monitoring to monitor GPU machines without Genv installed. This allows system administrators to examine overall cluster utilization very easily.
In my case, I have two remote machines: gpu-server-1
and gpu-server-2
.
Make sure you have SSH access to all remote hosts and that the SSH configuration is set properly. You can verify that using a command similar to this:
ssh gpu-server-1 echo "hello from \$(hostname)"
Warning
It is important that you verify the SSH access.
If you can’t access any of the remote hosts using a command similar to the one above, genv remote
commands will not work properly.
Running the monitoring service¶
Now, start the monitoring service using the following command:
genv remote -H gpu-server-1,gpu-server-2 monitor
Note
genv remote monitor
acts as a foreground daemon and runs until a Ctrl+C
is received.
Therefore, you will need to keep the terminal running while monitoring the system.
Prometheus¶
The Prometheus instructions are similar to the local monitoring instructions. Follow them here.
Now, you can open your browser at http://localhost:9090 and access Genv metrics from all remote hosts.
Grafana¶
The Grafana instructions are similar to the local monitoring instructions. Follow them here.
Now, you can open your browser at http://localhost:3000 and see the Genv dashboard with metrics from all remote hosts.
You should now see a dashboard similar to the following:
Running as a daemon¶
The instructions to run Genv remote monitoring as a daemon are similar to the local monitoring instructions. Follow them here.
Make sure you are running on a machine that would not shutdown, restart or hybernate. A personal laptop is not a good choice.
Note
Make sure you use genv remote monitor
commands and not the local monitoring ones.