Monitoring

Note

This article talks about monitoring the local machine only. For more information about how to monitor multiple machines, see here.

Overview

Genv supports system and resource monitoring using Prometheus and Grafana.

../_images/monitoring-overview1.png

Genv monitoring overview

This is done with the Genv monitoring service that collects metrics about the system and resources and exports it in Prometheus format.

The monitoring service provides default configuration files for Prometheus and Grafana as well as a default Grafana dashboard. This means that everything works as plug-and-play right out of the box.

Quick start

This is a guide to get started with monitoring features in Genv.

Prerequisites

First, you will need to install the prometheus-client PyPI package:

pip install prometheus-client

Note

This is installed automatically when installing Genv with pip install genv[monitor]

Running the monitoring service

Now, start the monitoring service using the following command:

genv monitor

Note

genv monitor acts as a foreground daemon and runs until a Ctrl+C is received. Therefore, you will need to keep the terminal running while monitoring the system.

Prometheus

First, download the Prometheus precompiled binaries.

Then, open another terminal and unzip the archive file using the command:

tar xvfz prometheus-*.tar.gz
cd prometheus-*/

The Genv monitoring service publishes a configuration file for Prometheus. By default, it is published at /var/tmp/genv/metrics/prometheus/prometheus.yml.

You can see its contents using cat:

$ cat /var/tmp/genv/metrics/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
- job_name: genv
  static_configs:
    - targets: ['localhost:8000']

This essentially tells Prometheus to scrape the Genv exporter which is available at port 8000 thanks to the monitoring service we ran using the command genv monitor.

Now, let’s run Prometheus and specify the configuration file path:

./prometheus --config.file=/var/tmp/genv/metrics/prometheus/prometheus.yml

Now, you can open your browser at http://localhost:9090 and access Genv metrics.

Grafana

First, open another terminal and download and extract the Grafana precompiled binaries. Then, enter the directory:

cd grafana-*/

The Genv monitoring service publishes a configuration file for Grafana. By default, it is published at /var/tmp/genv/metrics/grafana/grafana.ini.

You can see its contents using cat:

$ cat /var/tmp/genv/metrics/grafana/grafana.ini
[auth.anonymous]
enabled = true
org_name = Main Org.
org_role = Viewer

[paths]
provisioning = /var/tmp/genv/metrics/grafana/provisioning

[dashboards]
default_home_dashboard_path=/var/tmp/genv/metrics/grafana/dashboards/overview.json

This essentially tells Grafana where its datasources and dashboards are, as well as configures the default dashboard.

As mentioned before, the Genv monitoring service also provides a Prometheus data source as well as dashboards. You can see the contents of /var/tmp/genv/metrics/grafana using find:

find /var/tmp/genv/metrics/grafana
/var/tmp/genv/metrics/grafana
/var/tmp/genv/metrics/grafana/dashboards
/var/tmp/genv/metrics/grafana/dashboards/overview.json
/var/tmp/genv/metrics/grafana/provisioning
/var/tmp/genv/metrics/grafana/provisioning/datasources
/var/tmp/genv/metrics/grafana/provisioning/datasources/default.yml
/var/tmp/genv/metrics/grafana/provisioning/dashboards
/var/tmp/genv/metrics/grafana/provisioning/dashboards/default.yml
/var/tmp/genv/metrics/grafana/grafana.ini

Now, let’s run Grafana and specify the configuration file path:

./bin/grafana-server --config /var/tmp/genv/metrics/grafana/grafana.ini web

Now, you can open your browser at http://localhost:3000 and see the Genv dashboard. You should now see a dashboard similar to the following:

../_images/monitoring-dashboard1.png

Genv monitoring dashboard

Permissions

The monitoring needs to query the environment variables of processes in order to tell their Genv environment identifier.

Linux users usually can’t query the environment variables of other users. Therefore, you will probably need to execute the genv monitor commands using sudo with a command similar to the following:

sudo genv monitor ...

Running as a daemon

genv monitor acts as a foreground daemon and runs until a Ctrl+C is received. Therefore, you will need to keep the terminal running while monitoring the system.

When monitoring a GPU machine or a cluster of GPU machines, one might want to run the monitoring for long periods of time, like days and even weeks. To do so, the Genv monitoring daemon should not be attached to a specific terminal session, so that it would continue running when the session exits.

We recommend to use tmux for this.

Here is an example of how to use tmux for running genv monitor in the background.

Create a new tmux session and name it genv-monitor with the command:

tmux new -s genv-monitor

Run genv monitor inside:

genv monitor

Detach from the session with Ctrl-b + d.

Then, you can reattach after some time with the command:

tmux attach -t genv-monitor

Reference

Metric

Labels

Description

genv_is_installed

Genv installation status

genv_device_temperature

index

Device temperature in degrees C

genv_device_utilization

index

Device utilization

genv_device_memory_used_bytes

index

Device used memory in bytes

genv_device_memory_total_bytes

index

Device total memory in bytes

genv_environments_total

Number of active environments

genv_processes_total

Number of running processes

genv_attached_devices_total

Number of attached devices

genv_users_total

Number of active users

genv_environment_processes_total

eid

Number of running processes in an environment

genv_environment_attached_devices_total

eid

Number of attached devices of an environment

genv_process_devices_total

pid, eid

Number of devices used by a process

genv_process_used_gpu_memory_bytes

pid, eid, device

Used GPU memory by a process

genv_user_environments_total

username

Number of active environments of a user

genv_user_processes_total

username

Number of running processes of a user

genv_user_attached_devices_total

username

Number of attached devices of a user