How to control thermal and performance settings for multiple Nvidia GPUS on Ubuntu Linux

in #gpu7 years ago

Such compute.

very co$t

Wow

The price of a pair of GTX 1060 GPUs has gone up about 50% since I built my deep-learning rig a few weeks ago, and that's if you can even find them in stock. There's been a wee bit of a gold rush surrounding cryptocurrencies lately as many new miners have been setting up systems. Ultimately, I think this will benefit more than just crypto as the demand for fast and efficient cards pushes graphics card makers to innovate more efficient and powerful cards, just like high performance computing for scientific purposes has traditionally piggy-backed on demand for better gaming. It also means increased awareness and adoption of cryptography and cryptocurrencies, which I consider a good thing as it should help stabilize the ecosystem.

In any case, whether your aim is mining ether or back-propagation, you may want to get as much performance out of the GPUs you do have during the current shortage. This means tuning the card to optimize for your needs of performance and/or efficiency. For Nvidia cards on Ubuntu this comes with a slight difficulty in that normally you can only tune a GPU running a display, but with a few tricks it's possible to overclock multiple GPUs without hooking up a monitor to each one. This took me a while to figure out, so I thought it may be helpful to others.

Note: changing the cool-bits flag lets you bypass thermal safeguards, may affect warranty, etc., so be conservative in your changes and monitor for GPU temperatures and errors. I typically run the fans at a higher intensity than they would normally operate and keep the temperature well below 70C.

In short, it was the order that mattered. Setting cool-bits with or without a a flag to allow empty configurations, before editing the config file always left me with control over just one GPU after rebooting :-/ Instead I had to first modify the config file, and only then allow empty configurations and change the cool-bits flag. I'll assume you've got your drivers set up and your cards are working, all you have left is to gain control of nvidia-settings. Here's what worked for me:

First edit your Xorg config file. Duplicate the monitor/device/screen declarations while incrementing the names, e.g. "Device0" becomes "Device1." Do this as many times as you need to for each of your GPUs, I have two cards so I ended up with two screens/monitors/device entries. The text for my config file is at the end of this post.

sudo nano /etc/X11/xorg.conf

nano --> your text editor of choice

Then set the -cool-bits flag and allow empty configurations. Setting cool-bits to 28 actually allows you to change GPU voltages, which I don't currently use or recommend. 12 or 5 should also work for our needs.


sudo nvidia-xconfig -a --cool-bits=28 --allow-empty-initial-configuration

and that's it. You should be able to reboot and start over/underclocking your GPUs. Check in on the temperature and power usage with nvidia-smi on the command line. You can modify the thermal and performance ("Powermizer") settings with the GUI by just typing nvidia-settings, or you can use commands like these:


nvidia-settings -a [gpu:0]/GPUMemoryTransferRateOffset[3]=400
nvidia-settings -a [gpu:0]/GPUGraphicsClockOffset[3]=40
nvidia-settings -a [gpu:0]/GPUFanControlState=1
nvidia-settings -a [fan:0]/GPUTargetFanSpeed=65

or back to normal

nvidia-settings -a [gpu:0]/GPUMemoryTransferRateOffset[3]=0
nvidia-settings -a [gpu:0]/GPUGraphicsClockOffset[3]=0
nvidia-settings -a [gpu:0]/GPUFanControlState=0

Tune the settings in small increments and just change one setting at a time until you get closer to optimizing your chosen metric, then adjust the next setting and repeat as necessary ("walking the settings"). There are many more, much better guides out there for the actual overclocking for performance or underclocking for efficiency, and I suggest you check them out. Hopefully one or two other people had the same problem as I did with the order of setting cool-bits and modifying the xorg.conf file and this short post will be useful to some fellow human, somewhere, sometime. Thanks!

I am using Ubuntu 16.04 with the 375.66 version Nvidia drivers.

xorg.conf example:

# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 375.66  (buildmeister@swio-display-x86-rhel47-06)  Mon May  1 15:45:32 PDT 2017

Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0"
    Screen      1  "Screen1" RightOf "Screen0"
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/psaux"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Monitor"
    Identifier     "Monitor1"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "GeForce GTX 1060 6GB"
    BusID          "PCI:1:0:0"
EndSection

Section "Device"
    Identifier     "Device1"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "GeForce GTX 1060 6GB"
    BusID          "PCI:2:0:0"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "AllowEmptyInitialConfiguration" "True"
    Option         "Coolbits" "28"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

Section "Screen"
    Identifier     "Screen1"
    Device         "Device1"
    Monitor        "Monitor1"
    DefaultDepth    24
    Option         "AllowEmptyInitialConfiguration" "True"
    Option         "Coolbits" "28"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

Sort:  

__Not working on Nvidia 390.x drivers __

I tried your instructions sudo nvidia-xconfig... to set cool bits to 28 and then rebooting. I think it auto rebooted once again.
When I then tried setting the FanSpeed... still got the same error (Unknown error)
My guess, when it auto rebooted the second time (not sure why/how that happened), it recreated the xorg.conf file and overwrote the coolbits 28 that I had added. I know this for sure coz after using the CLI to add the coolbits I checked the updated .conf file.

Looks like Nvidia changed/disabled coolbits updates to .conf in their newer drivers

Congratulations @thescinder! You have received a personal award!

1 Year on Steemit
Click on the badge to view your Board of Honor.

Do not miss the last post from @steemitboard!


Participate in the SteemitBoard World Cup Contest!
Collect World Cup badges and win free SBD
Support the Gold Sponsors of the contest: @good-karma and @lukestokes


Do you like SteemitBoard's project? Then Vote for its witness and get one more award!

Congratulations @thescinder! You received a personal award!

Happy Birthday! - You are on the Steem blockchain for 2 years!

You can view your badges on your Steem Board and compare to others on the Steem Ranking

Vote for @Steemitboard as a witness to get one more award and increased upvotes!