Sunday, December 6, 2015

HOW-TO: Control fan speed of multiple headless GPUs

Hi all,
    Deep learning is all the rage now. It is not a surprise if you have two or more GPUs, but it is problematic when the heat is accumulated among different GPUs. There is a need to increase fan speed of GPUs because the default speed is quite slow (22%). In this post, I will guide you how to increase fan of multiple GPUs. All you need to do is to modify the /etc/X11/xorg.conf. I have successfully increased my four GPUs with this method.
    The principle is very simple. A GPU needs to have X screen attached to it in order to increase the fan speed of that GPU through Coolbits [1]. First, we use nvidia-settings to attach the second monitor to each GPU. The first monitor is always attached to the first GPU for display. The second monitor is alternatively attached to the second GPU to create another X screen.  See the below image for how to do it through GUI.
    The steps are following: First, we need sudo privilege to generate X-conf (only if you do not have).
sudo nvidia-xconfig
Next, open the X-configuration file /etc/X11/xorg.conf, create for each GPUs a X-screen, add coolbit option to each X-screen. A X-screen has parameters of monitor and device (GPU). What you need to do to replicate monitors, GPU devices, and their corresponding X-screen. Each GPU devices have only different in bus-ID. We can check it in nvidia-smi command. Then, we can add each X-screen into ServerLayout.
# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 384.69  (buildmeister@swio-display-x86-rhel47-06)  Wed Aug 16 20:57:01 PDT 2017

Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0" 0 0
    Screen      1  "Screen1" RightOf "Screen0"
    Screen      2  "Screen2" RightOf "Screen1"
    Screen      3  "Screen3" RightOf "Screen2"
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection
GPU
Section "InputDevice"
    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/psaux"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Monitor"
    Identifier     "Monitor1"
    VendorName     "Unknown"
    ModelName      "DELL U2410"
    HorizSync       0.0 - 0.0
    VertRefresh     0.0
EndSection

Section "Monitor"
    Identifier     "Monitor2"
    VendorName     "Unknown"
    ModelName      "DELL U2410"
    HorizSync       0.0 - 0.0
    VertRefresh     0.0
EndSection

Section "Monitor"
    Identifier     "Monitor3"
    VendorName     "Unknown"
    ModelName      "DELL U2410"
    HorizSync       0.0 - 0.0
    VertRefresh     0.0
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BusID          "PCI:5:0:0"
EndSection

Section "Device"
    Identifier     "Device1"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BusID          "PCI:6:0:0"
EndSection

Section "Device"
    Identifier     "Device2"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BusID          "PCI:9:0:0"
EndSection

Section "Device"
    Identifier     "Device3"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BusID          "PCI:10:0:0"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "Coolbits" "4"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

Section "Screen"
    Identifier     "Screen1"
    Device         "Device1"
    Monitor        "Monitor1"
    DefaultDepth    24
    Option         "Coolbits" "4"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

Section "Screen"
    Identifier     "Screen2"
    Device         "Device2"
    Monitor        "Monitor2"
    DefaultDepth    24
    Option         "Coolbits" "4"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

Section "Screen"
    Identifier     "Screen3"
    Device         "Device3"
    Monitor        "Monitor3"
    DefaultDepth    24
    Option         "Coolbits" "4"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

Finally log out and log in again, voila, you can change your fan speed now. Please save a X-configuration for uses in the future:
sudo cp /etc/X11/xorg.conf /etc/X11/xorg.conf-backup-coolbit

    Then, we need to increase the fan speed of GPUs through a script [2]. For example, I want to increase fan speed to 95% for two GPUs. The bash script is as simple as follows:
#!/bin/bash

nvidia-settings -a [gpu:0]/GPUFanControlState=1 -a [fan:0]/GPUTargetFanSpeed=95 -a [gpu:1]/GPUFanControlState=1 -a [fan:1]/GPUTargetFanSpeed=95
 You should save it as set_gpu_fan.sh and add it into start-up program on Ubuntu. So the script will be automatically executed when you log into your workstation through an application of X servers. Alternatively, you can enable the persistent state of GPUs by add the following command into crontab. The state of GPUs will be preserved in persistent (including fan speed) until reboot.
sudo crontab -e
Then add the following snippet into crontab.
@reboot nvidia-smi --persistence-mode=ENABLED
Finally, the following methods only work for the case you have as many monitors as number of GPUs. The principle is the same for more than 2 GPUs. You just program the /etc/X11/xorg.conf to have a X screen for each GPU. Then, you can manually adjust fan speed of each GPU. However, the drawback of this method is that you need to log in after the workstation restarts, but I work for me on the commodity workstation for doing researches in a lab.