複数のGPUでLSTMを実行すると、「入力テンソルと非表示テンソルが同じデバイスにない」

Question

私はpytorchでLSTMレイヤーをトレーニングしようとしています。 4つのGPUを使用しています。初期化時に、隠しレイヤーをGPUに移動する.cuda（）関数を追加しました。しかし、複数のGPUでコードを実行すると、次のランタイムエラーが発生します。

RuntimeError: Input and hidden tensors are not at the same device

以下のようにforward関数で.cuda（）関数を使用して問題を解決しようとしました：

self.hidden = (self.hidden[0].type(torch.FloatTensor).cuda(), self.hidden[1].type(torch.FloatTensor).cuda())

この行で問題は解決したようですが、更新された非表示レイヤーが別のGPUで表示されているかどうかという懸念が生じます。バッチの転送関数の最後にベクトルをCPUに戻しますか、それとも問題を解決する他の方法がありますか？.

sayan · Answer

テンソルで.cuda()を呼び出すと、Pytorchはそれを現在のGPUデバイスにデフォルトで移動します（GPU-0）。したがって、データの並列処理により、モデルが別のGPUに移動している間、データは別のGPUに存在します。これにより、直面しているランタイムエラーが発生します。

リカレントニューラルネットワークにデータ並列処理を実装する正しい方法は次のとおりです。

from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence class MyModule(nn.Module): # ... __init__, other methods, etc. # padded_input is of shape [B x T x *] (batch_first mode) and contains # the sequences sorted by lengths # B is the batch size # T is max sequence length def forward(self, padded_input, input_lengths): total_length = padded_input.size(1) # get the max sequence length packed_input = pack_padded_sequence(padded_input, input_lengths, batch_first=True) packed_output, _ = self.my_lstm(packed_input) output, _ = pad_packed_sequence(packed_output, batch_first=True, total_length=total_length) return output m = MyModule().cuda() dp_m = nn.DataParallel(m)

また、マルチGPUセットアップでは、それに応じてCUDA_VISIBLE_DEVICES環境変数を設定する必要があります。

参照：