Enforcement¶
Contents
Overview¶
Genv provisions the GPU resources on a machine by attaching devices to environments.
It takes into account the available hardware as well as the environment configuration. In the future, the usage in real time will also be taken into account for optimal decisions.
The provisioned resources are then bookkept in state files for as long as the environments are active or the resources are released, by detaching devices for example.
After provisioning the resources, Genv helps environments to use only their granted resources by manipulating the shell environment and by executing shims.
However, all processes still technically have access to all devices and all their resources such as GPU memory, because these are bare-metal processes and not containers.
Genv supports enforcement features to allow users and system administrators to ensure that only the resources provisioned by Genv are being used by processes and environments using the command genv enforce
.
Note that to use enforcement capabilities on a machine that multiple users are sharing, you will need to run genv enforce
commands using sudo
.
Quick start¶
This is a guide to get started with enforcement features in Genv.
Follow this tutorial and run the commands on a GPU machine with Genv installed.
You will need a GPU consuming application.
This could be Python code that uses TensorFlow, PyTorch, or any other application that you have.
We will be using a CUDA application called quickstart
.
We will go over two enforcement rules in this tutorial.
Non-environment Processes¶
Open a terminal on your GPU machine and run the application in the background:
$ ./quickstart &
[1] 63600
We can make sure that the application is indeed using GPU resources with nvidia-smi
:
$ nvidia-smi
...
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 63600 C ./quickstart 7295MiB |
+-----------------------------------------------------------------------------+
Now, we will run a genv enforce
command and ask it to terminate processes that are not running within environments.
$ genv enforce --interval 0 --non-env-processes
Process 63600 is not running in a GPU environment
Terminating process 63600 from environment N/A that is running on GPU(s) 0
[1]+ Terminated ./quickstart
You can see that the process was terminated.
You can also verify it by running nvidia-smi
once again.
If you will activate an environment with attached devices (you can run genv activate --gpus 1
) and rerun all steps, you will see that now the process does not get terminated.
Max Devices per User¶
Activate an environment and attach a device to it:
$ genv activate --gpus 1
Let’s verify that the environment is active and is attached to a device using genv devices
:
$ genv devices
ID ENV ID ENV NAME ATTACHED
0 67609 42 seconds ago
Now run the application in the background as before:
$ ./quickstart &
[1] 67708
Now, we will run another genv enforce
command.
To start, we will allow each user to use one device.
$ genv enforce --interval 0 --max-devices-per-user 1
We can see that nothing happened as we are using a single device which is allowed.
Let’s rerun the command, but this time allow zero devices to be used:
$ genv enforce --interval 0 --max-devices-per-user 0
User raz is using 1 devices which is 1 more than the maximum allowed
Terminating process 67708 from environment 67609 that is running on GPU(s) 0
Detaching environment 67609 of user raz from device 0
[1]+ Terminated ./quickstart
You can see that the process was terminated and the environment was detached from the device.
To make sure the devices was detached from the environment, you can run genv devices
:
$ genv devices
ID ENV ID ENV NAME ATTACHED
0
Permissions¶
As described later on in the architecture section, Genv both detaches environments from devices and terminates running processes.
Detaching environments from devices is done by modifying the devices.json
file.
Because of how Genv creates these state files, all Linux users have access to modify them, therefore all Linux users have permissions to detach environments of any Linux user.
On the other hand, Linux users usually can’t terminate processes of other users or query their environment variables.
Therefore, you will probably need to execute the genv enforce
commands using sudo
with a command similar to the following:
sudo genv enforce ...
Architecture¶
The command genv enforce
acts as a foreground daemon, that is running in a while-loop and executes an enforcement cycle every once in a while.
Every cycle, Genv takes a snapshot of all the provisioned resources as well as the running GPU compute processes in real time by executing nvidia-smi
commands.
Then, Genv goes over the snapshot and runs different enforcement rules. Every enforcement rule checks if it is violated.
After running all enforcement rules, Genv combines all the conclusions and continues to the execution phase.
In the execution phase, it terminates running processes and detaches environments from devices according to the findings. Running processes from environments on the devices that are being detached are also terminated.
Note
Genv enforcer terminates only the GPU processes.
This means that IDEs (e.g. Visual Studio Code, PyCharm, etc.) and terminals will not be terminated, but the running task processes such as python
processes or Jupyter kernels.
Running as a daemon¶
genv enforce
acts as a foreground daemon and runs until a Ctrl+C
is received.
Therefore, you will need to keep the terminal running while enforcing the system.
To do so, the Genv enforcement daemon should not be attached to a specific terminal session, so that it would continue running when the session exits.
We recommend to use tmux for this.
Here is an example of how to use tmux
for running genv enforce
in the background.
Create a new tmux session and name it genv-enforce
with the command:
tmux new -s genv-enforce
Run genv enforce
inside:
sudo genv enforce
Detach from the session with Ctrl-b
+ d
.
Then, you can reattach after some time with the command:
tmux attach -t genv-enforce
Enforcement Rules¶
Enforcement rules are controlled using flags and arguments to genv enforce
.
You can also run genv enforce --help
to see all other supported flags and features.
Non-environment Processes¶
Use the flag --non-env-processes
to terminate running processes that access a GPU and are not running within an environment.
This is mostly used for ensuring that no one runs GPU applications that are not managed by Genv on a machine.
This ensures that Genv is the only way that GPU resources are being provisioned in the system.
Environment Devices¶
Enabled by default.
Use the flag --env-devices
to terminate processes that are using devices which are not attached to their environments.
Environment Memory Capacity¶
Enabled by default.
Use the flag --env-memory
to terminate processes from environments that exceed their memory capacity.
Processes are terminated only from devices on which the environment is exceeding its memory capacity. Not all processes are terminated, but a greedy algorithm terminates processes until enough memory was freed so that the environment does not exceed the capacity any more.
Max Devices per User¶
Use the flag --max-devices-per-user <value>
to control how many devices each Linux user can access.
If a Linux user is using more devices than the specified value, some of his or her environments would get detached to free up resources. Processes from the detached environments that are running on the detached devices would get terminated.
This flag could be used combined with --max-devices-for-user <value>
to specify user-specific values.
For example, by passing --max-devices-per-user 1 --max-devices-for-user john=3 paul=2
, you enforce all users to use a single device at most, except for John and Paul which are allowed to use 3 and 2 respectively.