Monitoring

Overview

Before starting with remote monitoring features, it is highly recommended to go over the local monitoring features guide which is available here.

Genv remote features allow users and system administrators to provision GPU resources accross multiple machines.

Genv remote monitoring features allow users and system administrators to monitor the resources and usage accross multiple machines using the command genv remote monitor.

../_images/monitoring-overview.png

Genv remote monitoring overview

Quick start

This is a guide to get started with remote monitoring features in Genv.

Prerequisites

First, you will need to install Genv on your local machine with the required packages for monitoring:

pip install genv[monitor]

Note

If you have already installed Genv without the monitor required packages, install them with:

pip install prometheus-client

Next, you will have to configure SSH access to remote GPU machines.

Go over the remote installation overview and understand how Genv remote features work. It is recommended to install Genv on the remote machines and configure their SSH daemons. However, you can also use Genv remote monitoring to monitor GPU machines without Genv installed. This allows system administrators to examine overall cluster utilization very easily.

In my case, I have two remote machines: gpu-server-1 and gpu-server-2.

Make sure you have SSH access to all remote hosts and that the SSH configuration is set properly. You can verify that using a command similar to this:

ssh gpu-server-1 echo "hello from \$(hostname)"

Warning

It is important that you verify the SSH access. If you can’t access any of the remote hosts using a command similar to the one above, genv remote commands will not work properly.

Running the monitoring service

Now, start the monitoring service using the following command:

genv remote -H gpu-server-1,gpu-server-2 monitor

Note

genv remote monitor acts as a foreground daemon and runs until a Ctrl+C is received. Therefore, you will need to keep the terminal running while monitoring the system.

Prometheus

The Prometheus instructions are similar to the local monitoring instructions. Follow them here.

Now, you can open your browser at http://localhost:9090 and access Genv metrics from all remote hosts.

Grafana

The Grafana instructions are similar to the local monitoring instructions. Follow them here.

Now, you can open your browser at http://localhost:3000 and see the Genv dashboard with metrics from all remote hosts.

You should now see a dashboard similar to the following:

../_images/monitoring-dashboard.png

Genv monitoring dashboard

Running as a daemon

The instructions to run Genv remote monitoring as a daemon are similar to the local monitoring instructions. Follow them here.

Make sure you are running on a machine that would not shutdown, restart or hybernate. A personal laptop is not a good choice.

Note

Make sure you use genv remote monitor commands and not the local monitoring ones.

Reference

The Genv remote monitoring service exports the same metrics as the local monitoring service with the additional label hostname.

You can check out all the available metrics here.