GPUクラスターの監視

Question

Ubuntu 14.04 x64で10台のサーバーを実行しています。各サーバーには、いくつかのNvidiaGPUがあります。すべてのサーバーのGPU使用状況を一目で確認できる監視プログラムを探しています。

Franck Dernoncourt · Answer

ganglia 監視ソフトウェア（無料、オープンソース）を使用できます。 GPU Nvidiaモジュールを含むユーザー提供のGmond Python DSOメトリックモジュール）の数があります（ /ganglia/gmond_python_modules/gpu/nvidia/ ）。

そのアーキテクチャーは、クラスター監視ソフトウェアに典型的です。

（画像のソース）

明確なドキュメントがないGPUNvidiaモジュールを除いて、インストールは簡単です（急ぐことなく約30分）。（私はまだスタック）

Gangliaをインストールするには、次のようにします。サーバー上：

Sudo apt-get install -y ganglia-monitor rrdtool gmetad ganglia-webfrontend

Apacheについて質問するたびに、Yesを選択します

最初、Gangliaサーバーを構成します。つまり、gmetad：

Sudo cp /etc/ganglia-webfrontend/Apache.conf /etc/Apache2/sites-enabled/ganglia.conf Sudo nano /etc/ganglia/gmetad.conf

gmetad.conf、次の変更を加えます。

交換：

data_source "my cluster" localhost

によって（その192.168.10.22はサーバーのIPです）

data_source "my cluster" 50 192.168.10.22:8649

これは、Gangliaが8649ポート（Gangliaのデフォルトポート）でリッスンする必要があることを意味します。監視する予定のマシンで実行されるGangliaクライアントからIPとポートにアクセスできることを確認する必要があります。

これでGangliaサーバーを起動できます。

Sudo /etc/init.d/gmetad restart Sudo /etc/init.d/Apache2 restart

次のWebインターフェイスにアクセスできます http://192.168.10.22/ganglia/ （where 192.168.10.22はサーバーのIPです）

2番目、同じマシンまたは別のマシンでGangliaクライアント（つまりgmond）を構成します。

Sudo apt-get install -y ganglia-monitor Sudo nano /etc/ganglia/gmond.conf

gmond.conf、Gangliaクライアント、つまりgmondがサーバーを指すように、次の変更を加えます。

交換：

cluster { name = "unspecified" owner = "unspecified" latlong = "unspecified" url = "unspecified" }

に

cluster { name = "my cluster" owner = "unspecified" latlong = "unspecified" url = "unspecified" }

交換

udp_send_channel { mcast_join = 239.2.11.71 port = 8649 ttl = 1 }

沿って

udp_send_channel { # mcast_join = 239.2.11.71 Host = 192.168.10.22 port = 8649 ttl = 1 }

交換：

udp_recv_channel { mcast_join = 239.2.11.71 port = 8649 bind = 239.2.11.71 }

に

udp_recv_channel { # mcast_join = 239.2.11.71 port = 8649 # bind = 239.2.11.71 }

これで、Gangliaクライアントを起動できます。

Sudo /etc/init.d/ganglia-monitor restart

サーバーのGangliaWebインターフェイスに30秒以内に表示されます（つまり、 http://192.168.10.22/ganglia/ ）。

gmond.confファイルはすべてのクライアントで同じです。数秒で、新しいマシンに神経節モニタリングを追加できます。

Sudo apt-get install -y ganglia-monitor wget http://somewebsite/gmond.conf # this gmond.conf is configured so that it points to the right ganglia server, as described above Sudo cp -f gmond.conf /etc/ganglia/gmond.conf Sudo /etc/init.d/ganglia-monitor restart

次のガイドを使用しました。

監視するすべてのサーバーでgmondを開始または再起動するbashスクリプト：

deploy.sh：

#!/usr/bin/env bash # Some useful resources: # while read ip user pass; do : http://unix.stackexchange.com/questions/92664/how-to-deploy-programs-on-multiple-machines # -o StrictHostKeyChecking=no: http://askubuntu.com/questions/180860/regarding-Host-key-verification-failed # -T: http://stackoverflow.com/questions/21659637/how-to-fix-Sudo-no-tty-present-and-no-askpass-program-specified-error # echo $pass |: http://stackoverflow.com/questions/11955298/use-Sudo-with-password-as-parameter # http://stackoverflow.com/questions/36805184/why-is-this-while-loop-not-looping while read ip user pass <&3; do echo $ip sshpass -p "$pass" ssh $user@$ip -o StrictHostKeyChecking=no -T " echo $pass | Sudo -S Sudo /etc/init.d/ganglia-monitor restart " echo 'done' done 3<servers.txt

servers.txt：

53.12.45.74 my_username my_password 54.12.45.74 my_username my_password 57.12.45.74 my_username my_password ‌‌

Webインターフェイスのメインページのスクリーンショット：

https://www.safaribooksonline.com/library/view/monitoring-with-ganglia/9781449330637/ch04.html は、Ganglia Webインターフェイスの概要を提供します。

cas · Answer

munin には、nvidia GPU（nvidia-smiユーティリティを使用してデータを収集する）を監視するための少なくとも1つの plugin があります。

muninサーバーをセットアップし（おそらくGPUサーバーの1つ、またはクラスターのヘッドノードに）、次にmunin-nodeクライアントとnvidiaプラグイン（およびその他のプラグイン）をインストールします。各GPUサーバーで）に興味があるかもしれません。

これにより、各サーバーのmuninデータを詳細に確認したり、すべてのサーバーのnvidiaデータの概要を確認したりできます。これには、GPU温度などの経時変化をグラフ化したグラフが含まれます

それ以外の場合は、ssh（または pdsh ）を使用して各サーバーでnvidia-smiユーティリティを実行し、必要なデータを抽出して、必要な形式で表示するスクリプトを作成できます。

Patwie · Answer

または単に使用する

https://github.com/PatWie/cluster-smi

これは、ターミナルのnvidia-smiとまったく同じように機能しますが、cluster-smi-nodeを実行しているクラスター全体のノードのすべての情報を収集します。出力は

+---------+------------------------+---------------------+----------+----------+ | Node | Gpu | Memory-Usage | Mem-Util | GPU-Util | +---------+------------------------+---------------------+----------+----------+ | node-00 | 0: TITAN Xp | 3857MiB / 12189MiB | 31% | 0% | | | 1: TITAN Xp | 11689MiB / 12189MiB | 95% | 0% | | | 2: TITAN Xp | 10787MiB / 12189MiB | 88% | 0% | | | 3: TITAN Xp | 10965MiB / 12189MiB | 89% | 100% | +---------+------------------------+---------------------+----------+----------+ | node-01 | 0: TITAN Xp | 11667MiB / 12189MiB | 95% | 100% | | | 1: TITAN Xp | 11667MiB / 12189MiB | 95% | 96% | | | 2: TITAN Xp | 8497MiB / 12189MiB | 69% | 100% | | | 3: TITAN Xp | 8499MiB / 12189MiB | 69% | 98% | +---------+------------------------+---------------------+----------+----------+ | node-02 | 0: GeForce GTX 1080 Ti | 1447MiB / 11172MiB | 12% | 8% | | | 1: GeForce GTX 1080 Ti | 1453MiB / 11172MiB | 13% | 99% | | | 2: GeForce GTX 1080 Ti | 1673MiB / 11172MiB | 14% | 0% | | | 3: GeForce GTX 1080 Ti | 6812MiB / 11172MiB | 60% | 36% | +---------+------------------------+---------------------+----------+----------+

3つのノードを使用する場合。

NVMLを使用して、効率のためにこれらの値を直接読み取ります。他の回答で提案されているように、nvidia-smiの出力をnot解析することをお勧めします。さらに、Python + ZMQを使用してcluster-smiからこれらの情報を追跡できます。

Franck Dernoncourt · Answer

Cas said として、私は自分のツールを書くことができたので、ここにあります（まったく洗練されていませんが、動作します）。

クライアント側（つまり、GPUノード）

gpu_monitoring.sh（監視Webページを提供するサーバーのIPが128.52.200.39であると想定）

while true; do nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv >> gpu_utilization.log; python gpu_monitoring.py sshpass -p 'my_password' scp -o StrictHostKeyChecking=no ./gpu_utilization_100.png my_username@128.52.200.39:/var/www/html/gpu_utilization_100_server1.png sshpass -p 'my_password' scp -o StrictHostKeyChecking=no ./gpu_utilization_10000.png my_username@128.52.200.39:/var/www/html/gpu_utilization_10000_server1.png sleep 10; done

gpu_monitoring.py：

''' Monitor GPU use ''' from __future__ import print_function from __future__ import division import numpy as np import matplotlib import os matplotlib.use('Agg') # http://stackoverflow.com/questions/2801882/generating-a-png-with-matplotlib-when-display-is-undefined import matplotlib.pyplot as plt import time import datetime def get_current_milliseconds(): ''' http://stackoverflow.com/questions/5998245/get-current-time-in-milliseconds-in-python ''' return(int(round(time.time() * 1000))) def get_current_time_in_seconds(): ''' http://stackoverflow.com/questions/415511/how-to-get-current-time-in-python ''' return(time.strftime("%Y-%m-%d_%H-%M-%S", time.gmtime())) def get_current_time_in_miliseconds(): ''' http://stackoverflow.com/questions/5998245/get-current-time-in-milliseconds-in-python ''' return(get_current_time_in_seconds() + '-' + str(datetime.datetime.now().microsecond)) def generate_plot(gpu_log_filepath, max_history_size, graph_filepath): ''' ''' # Get data history_size = 0 number_of_gpus = -1 gpu_utilization = [] gpu_utilization_one_timestep = [] for line_number, line in enumerate(reversed(open(gpu_log_filepath).readlines())): # http://stackoverflow.com/questions/2301789/read-a-file-in-reverse-order-using-python if history_size > max_history_size: break line = line.split(',') if line[0].startswith('util') or len(gpu_utilization_one_timestep) == number_of_gpus: if number_of_gpus == -1 and len(gpu_utilization_one_timestep) > 0: number_of_gpus = len(gpu_utilization_one_timestep) if len(gpu_utilization_one_timestep) == number_of_gpus: gpu_utilization.append(list(reversed(gpu_utilization_one_timestep))) # reversed because since we read the log file from button to up, GPU order is reversed. #print('gpu_utilization_one_timestep: {0}'.format(gpu_utilization_one_timestep)) history_size += 1 else: #len(gpu_utilization_one_timestep) <> number_of_gpus: pass #print('gpu_utilization_one_timestep: {0}'.format(gpu_utilization_one_timestep)) gpu_utilization_one_timestep = [] if line[0].startswith('util'): continue try: current_gpu_utilization = int(line[0].strip().replace(' %', '')) except: print('line: {0}'.format(line)) print('line_number: {0}'.format(line_number)) 1/0 gpu_utilization_one_timestep.append(current_gpu_utilization) # Plot graph #print('gpu_utilization: {0}'.format(gpu_utilization)) gpu_utilization = np.array(list(reversed(gpu_utilization))) # We read the log backward, i.e., ante-chronological. We reverse again to get the chronological order. #print('gpu_utilization.shape: {0}'.format(gpu_utilization.shape)) fig = plt.figure(1) ax = fig.add_subplot(111) ax.plot(range(gpu_utilization.shape[0]), gpu_utilization) ax.set_title('GPU utilization over time ({0})'.format(get_current_time_in_miliseconds())) ax.set_xlabel('Time') ax.set_ylabel('GPU utilization (%)') gpu_utilization_mean_per_gpu = np.mean(gpu_utilization, axis=0) lgd = ax.legend( [ 'GPU {0} (avg {1})'.format(gpu_number, np.round(gpu_utilization_mean, 1)) for gpu_number, gpu_utilization_mean in Zip(range(gpu_utilization.shape[1]), gpu_utilization_mean_per_gpu)] , loc='center right', bbox_to_anchor=(1.45, 0.5)) plt.savefig(graph_filepath, dpi=300, format='png', bbox_inches='tight') plt.close() def main(): ''' This is the main function ''' # Parameters gpu_log_filepath = 'gpu_utilization.log' max_history_size = 100 max_history_sizes =[100, 10000] for max_history_size in max_history_sizes: graph_filepath = 'gpu_utillization_{0}.png'.format(max_history_size) generate_plot(gpu_log_filepath, max_history_size, graph_filepath) if __name__ == "__main__": main() #cProfile.run('main()') # if you want to do some profiling

サーバー側（つまり、Webサーバー）

gpu.html：

<!DOCTYPE html> <html> <body> <h2>gpu_utilization_server1.png</h2> <img src="gpu_utilization_100_server1.png" alt="Mountain View" style="height:508px;"><img src="gpu_utilization_10000_server1.png" alt="Mountain View" style="height:508px;"> </body> </html>