MPIを使用してCで2D配列のブロックを送信する

Question

2次元配列のブロックを異なるプロセッサにどのように送信しますか？ 2D配列のサイズが400x400で、100x100のサイズのブロックを異なるプロセッサに送信するとします。アイデアは、各プロセッサが個別のブロックで計算を実行し、最終結果のためにその結果を最初のプロセッサに送り返すことです。
CプログラムでMPIを使用しています。

Jonathan Dursi · Answer

一般に、これをしたくないということから始めましょう。「マスター」プロセスから大量のデータを分散して収集します。通常、各タスクはパズルの断片ごとに動き回る必要があり、1つのプロセッサーがデータ全体の「グローバルビュー」を必要としないようにする必要があります。それが必要になるとすぐに、スケーラビリティと問題のサイズが制限されます。 I/Oでこれを行っている場合-1つのプロセスがデータを読み取り、それを分散し、書き込みのために収集します。最終的にはMPI-IOを調べる必要があります。

ただし、あなたの質問に答えるには、MPIには、メモリから任意のデータを引き出し、それをプロセッサのセットとの間で分散/収集する非常に良い方法があります。残念ながら、かなりの数の= MPIコンセプト-MPIタイプ、エクステント、および集合操作。基本的なアイデアの多くは、この質問への回答で説明されています- MPI_Type_create_subarrayおよびMPI_Gather 。

Update-寒い日には、これは多くのコードであり、多くの説明ではありません。少し拡大させてください。

タスク0が持つ多数のMPIタスクに分配したい1d整数のグローバル配列を考えてみてください。これにより、それぞれがローカル配列でピースを取得します。4つのタスクがあるとします。グローバル配列は_[01234567]_です。タスク0から4つのメッセージ（1つを含む）を送信してこれを配布し、再構築するときに4つのメッセージを受信して一緒にバンドルします。多数のプロセスで非常に時間がかかります。これらの種類の操作に最適化されたルーチンがあります-スキャッター/ギャザー操作です。

_int global[8]; /* only task 0 has this */ int local[2]; /* everyone has this */ const int root = 0; /* the processor with the initial global data */ if (rank == root) { for (int i=0; i<7; i++) global[i] = i; } MPI_Scatter(global, 2, MPI_INT, /* send everyone 2 ints from global */ local, 2, MPI_INT, /* each proc receives 2 ints into local */ root, MPI_COMM_WORLD); /* sending process is root, all procs in */ /* MPI_COMM_WORLD participate */ _

この後、プロセッサのデータは次のようになります

_task 0: local:[01] global: [01234567] task 1: local:[23] global: [garbage-] task 2: local:[45] global: [garbage-] task 3: local:[67] global: [garbage-] _

つまり、スキャッター操作はグローバル配列を取得し、すべてのプロセッサーに連続した2 intチャンクを送信します。

配列を再構築するには、MPI_Gather()操作を使用します。これはまったく同じですが、逆に機能します。

_for (int i=0; i<2; i++) local[i] = local[i] + rank; MPI_Gather(local, 2, MPI_INT, /* everyone sends 2 ints from local */ global, 2, MPI_INT, /* root receives 2 ints each proc into global */ root, MPI_COMM_WORLD); /* recv'ing process is root, all procs in */ /* MPI_COMM_WORLD participate */ _

そして今、データは次のようになります

_task 0: local:[01] global: [0134679a] task 1: local:[34] global: [garbage-] task 2: local:[67] global: [garbage-] task 3: local:[9a] global: [garbage-] _

Gatherはすべてのデータを戻しますが、ここでは10になっています。これは、この例を開始するときに、書式設定を十分に注意深く考えていなかったためです。

データポイントの数がプロセスの数を均等に分割せず、各プロセスに異なる数のアイテムを送信する必要がある場合はどうなりますか？次に、スキャッタの一般化バージョンMPI_Scatterv()が必要です。これにより、各プロセッサのカウントとディスプレイスメント（グローバル配列内のデータの開始位置を指定できます）。 9文字の__[abcdefghi]_の文字の配列があり、最後の3文字を除く2文字をすべてのプロセスに割り当てるとします。その後、あなたは必要になるだろう

_char global[9]; /* only task 0 has this */ char local[3]={'-','-','-'}; /* everyone has this */ int mynum; /* how many items */ const int root = 0; /* the processor with the initial global data */ if (rank == 0) { for (int i=0; i<8; i++) global[i] = 'a'+i; } int counts[4] = {2,2,2,3}; /* how many pieces of data everyone has */ mynum = counts[rank]; int displs[4] = {0,2,4,6}; /* the starting point of everyone's data */ /* in the global array */ MPI_Scatterv(global, counts, displs, /* proc i gets counts[i] pts from displs[i] */ MPI_INT, local, mynum, MPI_INT; /* I'm receiving mynum MPI_INTs into local */ root, MPI_COMM_WORLD); _

データは次のようになります

_task 0: local:[ab-] global: [abcdefghi] task 1: local:[cd-] global: [garbage--] task 2: local:[ef-] global: [garbage--] task 3: local:[ghi] global: [garbage--] _

これで、scattervを使用して不規則な量のデータを配布しました。それぞれの場合の変位は、配列の先頭から2 *ランク（文字で測定されます。変位は、スキャッターのために送信されるか、ギャザーのために受信されるタイプの単位です。通常、バイト単位ではありません）。カウントは{2,2,2,3}です。 3文字にしたい最初のプロセッサであった場合、counts = {3,2,2,2}を設定し、ディスプレイスメントは{0,3,5,7}になります。 Gathervはまったく同じように動作しますが、逆になります。 countsおよびdispls配列は同じままです。

さて、2Dの場合、これは少し複雑です。 2D配列の2Dサブロックを送信する場合、現在送信しているデータは連続していません。（たとえば）6x6配列の3x3サブブロックを4つのプロセッサに送信する場合、送信するデータには穴があります：

_2D Array --------- |000|111| |000|111| |000|111| |---+---| |222|333| |222|333| |222|333| --------- Actual layout in memory [000111000111000111222333222333222333] _

（すべての高性能コンピューティングは、メモリ内のデータのレイアウトを理解することに帰着することに注意してください。）

「1」とマークされたデータをタスク1に送信する場合、3つの値をスキップ、3つの値を送信、3つの値をスキップ、3つの値を送信、3つの値をスキップ、3つの値を送信します。 2番目の問題は、サブリージョンが停止および開始する場所です。領域「0」が停止しても、領域「1」は開始しないことに注意してください。領域「0」の最後の要素の後、メモリ内の次の場所は領域「1」の途中です。

最初に最初のレイアウトの問題に取り組みましょう-送信したいデータだけを引き出す方法。常にすべての「0」領域データを別の連続した配列にコピーして送信することができます。十分に注意深く計画した場合、結果に対して_MPI_Scatter_を呼び出すことができるような方法でそれを行うことさえできます。しかし、メインデータ構造全体をそのように転置する必要はありません。

これまでのところ、使用したすべてのMPIデータ型は単純なものです-MPI_INTは、たとえば4バイトを連続して指定します。ただし、MPIメモリ内の任意の複雑なデータレイアウトを記述する独自のデータ型を作成します。この場合（配列の長方形のサブ領域）は、そのための特定の呼び出しがあるほど十分に一般的です。上記の2次元の場合、

_ MPI_Datatype newtype; int sizes[2] = {6,6}; /* size of global array */ int subsizes[2] = {3,3}; /* size of sub-region */ int starts[2] = {0,0}; /* let's say we're looking at region "0", which begins at index [0,0] */ MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_INT, &newtype); MPI_Type_commit(&newtype); _

これにより、グローバル配列から領域「0」のみを選択するタイプが作成されます。そのデータだけを別のプロセッサに送信できます

_ MPI_Send(&(global[0][0]), 1, newtype, dest, tag, MPI_COMM_WORLD); /* region "0" */ _

そして、受信プロセスはそれをローカル配列に受信できます。受信プロセスは、3x3配列にのみ受信する場合、notがnewtypeのタイプとして受信するものを記述できることに注意してください。メモリレイアウトについては説明していません。代わりに、3 * 3 = 9整数のブロックを受信しています。

_ MPI_Recv(&(local[0][0]), 3*3, MPI_INT, 0, tag, MPI_COMM_WORLD); _

他のブロックに対して異なるタイプ（異なるstart配列）を作成するか、特定のブロックの開始点で送信するだけで、他のサブ領域に対してもこれを実行できることに注意してください。

_ MPI_Send(&(global[0][3]), 1, newtype, dest, tag, MPI_COMM_WORLD); /* region "1" */ MPI_Send(&(global[3][0]), 1, newtype, dest, tag, MPI_COMM_WORLD); /* region "2" */ MPI_Send(&(global[3][3]), 1, newtype, dest, tag, MPI_COMM_WORLD); /* region "3" */ _

最後に、ここではグローバルとローカルが連続したメモリチャンクである必要があることに注意してください。つまり、&(global[0][0])および&(local[0][0])（または、同等に、_*global_および_*local_は、メモリの連続する6 * 6および3 * 3チャンクを指します。動的なマルチd配列を割り当てる通常の方法では保証されていません。これを行う方法を以下に示します。

サブリージョンの指定方法が理解できたので、スキャッター/ギャザー操作を使用する前に議論することがもう1つあり、それがこれらのタイプの「サイズ」です。これらの型には16整数の範囲があるため、MPI_Scatter()（またはscatterv）をまだ使用できませんでした。つまり、開始後の終了位置は16整数であり、終了位置は次のブロックの開始位置とうまく一致しないため、散布を使用することはできません-データの送信を開始するのに間違った場所を選択します次のプロセッサに。

もちろん、MPI_Scatterv()を使用して、変位を自分で指定することもできます。変位はsendタイプのサイズの単位であり、それも役に立ちません。ブロックは、グローバル配列の先頭から（0,3,18,21）整数のオフセットで始まり、ブロックが開始点から16整数で終わるという事実は、それらの変位を整数倍でまったく表現できない。

これに対処するには、MPIを使用すると、これらの計算のために型の範囲を設定できます。型は切り捨てられません。次の要素の開始位置を特定するために使用されます最後の要素：これらのような穴のある型では、実際の型の最後までのメモリ内の距離よりも小さい範囲にエクステントを設定すると便利です。

範囲は、自分にとって都合の良いものに設定できます。エクステントを1整数にして、変位を整数単位で設定できます。ただし、この場合、エクステントを3つの整数（サブ行のサイズ）に設定したいので、ブロック "1"はブロック "0"の直後に始まり、ブロック "3"はブロック "の直後に始まります2 "。残念ながら、ブロック "2"からブロック "3"にジャンプするときはうまく動作しませんが、それは仕方がありません。

したがって、この場合にサブブロックを分散するには、次のようにします。

_ MPI_Datatype type, resizedtype; int sizes[2] = {6,6}; /* size of global array */ int subsizes[2] = {3,3}; /* size of sub-region */ int starts[2] = {0,0}; /* let's say we're looking at region "0", which begins at index [0,0] */ /* as before */ MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_INT, &type); /* change the extent of the type */ MPI_Type_create_resized(type, 0, 3*sizeof(int), &resizedtype); MPI_Type_commit(&resizedtype); _

ここでは、以前と同じブロックタイプを作成しましたが、サイズを変更しました。タイプが「開始」する場所（0）は変更していませんが、「終了」する場所（3 int）は変更しました。これについては前に言及しませんでしたが、タイプを使用するには_MPI_Type_commit_が必要です。ただし、実際に使用する最終型のみをコミットする必要があり、中間ステップはコミットしません。完了したら、_MPI_Type_free_を使用して型を解放します。

これで、最後に、ブロックを散布できます。上記のデータ操作は少し複雑ですが、一度完了すると、散布図は以前と同じようになります。

_int counts[4] = {1,1,1,1}; /* how many pieces of data everyone has, in units of blocks */ int displs[4] = {0,1,6,7}; /* the starting point of everyone's data */ /* in the global array, in block extents */ MPI_Scatterv(global, counts, displs, /* proc i gets counts[i] types from displs[i] */ resizedtype, local, 3*3, MPI_INT; /* I'm receiving 3*3 MPI_INTs into local */ root, MPI_COMM_WORLD); _

これで、散布、収集、およびMPI派生型の少しのツアーの後、完了です。

以下に、文字配列を使用して収集操作と分散操作の両方を示すサンプルコードを示します。プログラムの実行：

_$ mpirun -n 4 ./gathervarray Global array is: 0123456789 3456789012 6789012345 9012345678 2345678901 5678901234 8901234567 1234567890 4567890123 7890123456 Local process on rank 0 is: |01234| |34567| |67890| |90123| |23456| Local process on rank 1 is: |56789| |89012| |12345| |45678| |78901| Local process on rank 2 is: |56789| |89012| |12345| |45678| |78901| Local process on rank 3 is: |01234| |34567| |67890| |90123| |23456| Processed grid: AAAAABBBBB AAAAABBBBB AAAAABBBBB AAAAABBBBB AAAAABBBBB CCCCCDDDDD CCCCCDDDDD CCCCCDDDDD CCCCCDDDDD CCCCCDDDDD _

コードが続きます。

_#include <stdio.h> #include <math.h> #include <stdlib.h> #include "mpi.h" int malloc2dchar(char ***array, int n, int m) { /* allocate the n*m contiguous items */ char *p = (char *)malloc(n*m*sizeof(char)); if (!p) return -1; /* allocate the row pointers into the memory */ (*array) = (char **)malloc(n*sizeof(char*)); if (!(*array)) { free(p); return -1; } /* set up the pointers into the contiguous memory */ for (int i=0; i<n; i++) (*array)[i] = &(p[i*m]); return 0; } int free2dchar(char ***array) { /* free the memory - the first element of the array is at the start */ free(&((*array)[0][0])); /* free the pointers into the memory */ free(*array); return 0; } int main(int argc, char **argv) { char **global, **local; const int gridsize=10; // size of grid const int procgridsize=2; // size of process grid int rank, size; // rank of current process and no. of processes MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (size != procgridsize*procgridsize) { fprintf(stderr,"%s: Only works with np=%d for now
", argv[0], procgridsize); MPI_Abort(MPI_COMM_WORLD,1); } if (rank == 0) { /* fill in the array, and print it */ malloc2dchar(&global, gridsize, gridsize); for (int i=0; i<gridsize; i++) { for (int j=0; j<gridsize; j++) global[i][j] = '0'+(3*i+j)%10; } printf("Global array is:
"); for (int i=0; i<gridsize; i++) { for (int j=0; j<gridsize; j++) putchar(global[i][j]); printf("
"); } } /* create the local array which we'll process */ malloc2dchar(&local, gridsize/procgridsize, gridsize/procgridsize); /* create a datatype to describe the subarrays of the global array */ int sizes[2] = {gridsize, gridsize}; /* global size */ int subsizes[2] = {gridsize/procgridsize, gridsize/procgridsize}; /* local size */ int starts[2] = {0,0}; /* where this one starts */ MPI_Datatype type, subarrtype; MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_CHAR, &type); MPI_Type_create_resized(type, 0, gridsize/procgridsize*sizeof(char), &subarrtype); MPI_Type_commit(&subarrtype); char *globalptr=NULL; if (rank == 0) globalptr = &(global[0][0]); /* scatter the array to all processors */ int sendcounts[procgridsize*procgridsize]; int displs[procgridsize*procgridsize]; if (rank == 0) { for (int i=0; i<procgridsize*procgridsize; i++) sendcounts[i] = 1; int disp = 0; for (int i=0; i<procgridsize; i++) { for (int j=0; j<procgridsize; j++) { displs[i*procgridsize+j] = disp; disp += 1; } disp += ((gridsize/procgridsize)-1)*procgridsize; } } MPI_Scatterv(globalptr, sendcounts, displs, subarrtype, &(local[0][0]), gridsize*gridsize/(procgridsize*procgridsize), MPI_CHAR, 0, MPI_COMM_WORLD); /* now all processors print their local data: */ for (int p=0; p<size; p++) { if (rank == p) { printf("Local process on rank %d is:
", rank); for (int i=0; i<gridsize/procgridsize; i++) { putchar('|'); for (int j=0; j<gridsize/procgridsize; j++) { putchar(local[i][j]); } printf("|
"); } } MPI_Barrier(MPI_COMM_WORLD); } /* now each processor has its local array, and can process it */ for (int i=0; i<gridsize/procgridsize; i++) { for (int j=0; j<gridsize/procgridsize; j++) { local[i][j] = 'A' + rank; } } /* it all goes back to process 0 */ MPI_Gatherv(&(local[0][0]), gridsize*gridsize/(procgridsize*procgridsize), MPI_CHAR, globalptr, sendcounts, displs, subarrtype, 0, MPI_COMM_WORLD); /* don't need the local data anymore */ free2dchar(&local); /* or the MPI data type */ MPI_Type_free(&subarrtype); if (rank == 0) { printf("Processed grid:
"); for (int i=0; i<gridsize; i++) { for (int j=0; j<gridsize; j++) { putchar(global[i][j]); } printf("
"); } free2dchar(&global); } MPI_Finalize(); return 0; } _

gsamaras · Answer

その方法で確認する方が簡単だと思いました。

#include <stdio.h> #include <math.h> #include <stdlib.h> #include "mpi.h" /* This is a version with integers, rather than char arrays, presented in this very good answer: http://stackoverflow.com/a/9271753/2411320 It will initialize the 2D array, scatter it, increase every value by 1 and then gather it back. */ int malloc2D(int ***array, int n, int m) { int i; /* allocate the n*m contiguous items */ int *p = malloc(n*m*sizeof(int)); if (!p) return -1; /* allocate the row pointers into the memory */ (*array) = malloc(n*sizeof(int*)); if (!(*array)) { free(p); return -1; } /* set up the pointers into the contiguous memory */ for (i=0; i<n; i++) (*array)[i] = &(p[i*m]); return 0; } int free2D(int ***array) { /* free the memory - the first element of the array is at the start */ free(&((*array)[0][0])); /* free the pointers into the memory */ free(*array); return 0; } int main(int argc, char **argv) { int **global, **local; const int gridsize=4; // size of grid const int procgridsize=2; // size of process grid int rank, size; // rank of current process and no. of processes int i, j, p; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (size != procgridsize*procgridsize) { fprintf(stderr,"%s: Only works with np=%d for now
", argv[0], procgridsize); MPI_Abort(MPI_COMM_WORLD,1); } if (rank == 0) { /* fill in the array, and print it */ malloc2D(&global, gridsize, gridsize); int counter = 0; for (i=0; i<gridsize; i++) { for (j=0; j<gridsize; j++) global[i][j] = ++counter; } printf("Global array is:
"); for (i=0; i<gridsize; i++) { for (j=0; j<gridsize; j++) { printf("%2d ", global[i][j]); } printf("
"); } } //return; /* create the local array which we'll process */ malloc2D(&local, gridsize/procgridsize, gridsize/procgridsize); /* create a datatype to describe the subarrays of the global array */ int sizes[2] = {gridsize, gridsize}; /* global size */ int subsizes[2] = {gridsize/procgridsize, gridsize/procgridsize}; /* local size */ int starts[2] = {0,0}; /* where this one starts */ MPI_Datatype type, subarrtype; MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_INT, &type); MPI_Type_create_resized(type, 0, gridsize/procgridsize*sizeof(int), &subarrtype); MPI_Type_commit(&subarrtype); int *globalptr=NULL; if (rank == 0) globalptr = &(global[0][0]); /* scatter the array to all processors */ int sendcounts[procgridsize*procgridsize]; int displs[procgridsize*procgridsize]; if (rank == 0) { for (i=0; i<procgridsize*procgridsize; i++) sendcounts[i] = 1; int disp = 0; for (i=0; i<procgridsize; i++) { for (j=0; j<procgridsize; j++) { displs[i*procgridsize+j] = disp; disp += 1; } disp += ((gridsize/procgridsize)-1)*procgridsize; } } MPI_Scatterv(globalptr, sendcounts, displs, subarrtype, &(local[0][0]), gridsize*gridsize/(procgridsize*procgridsize), MPI_INT, 0, MPI_COMM_WORLD); /* now all processors print their local data: */ for (p=0; p<size; p++) { if (rank == p) { printf("Local process on rank %d is:
", rank); for (i=0; i<gridsize/procgridsize; i++) { putchar('|'); for (j=0; j<gridsize/procgridsize; j++) { printf("%2d ", local[i][j]); } printf("|
"); } } MPI_Barrier(MPI_COMM_WORLD); } /* now each processor has its local array, and can process it */ for (i=0; i<gridsize/procgridsize; i++) { for (j=0; j<gridsize/procgridsize; j++) { local[i][j] += 1; // increase by one the value } } /* it all goes back to process 0 */ MPI_Gatherv(&(local[0][0]), gridsize*gridsize/(procgridsize*procgridsize), MPI_INT, globalptr, sendcounts, displs, subarrtype, 0, MPI_COMM_WORLD); /* don't need the local data anymore */ free2D(&local); /* or the MPI data type */ MPI_Type_free(&subarrtype); if (rank == 0) { printf("Processed grid:
"); for (i=0; i<gridsize; i++) { for (j=0; j<gridsize; j++) { printf("%2d ", global[i][j]); } printf("
"); } free2D(&global); } MPI_Finalize(); return 0; }

出力：

linux16:>mpicc -o main main.c linux16:>mpiexec -n 4 main Global array is: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Local process on rank 0 is: | 1 2 | | 5 6 | Local process on rank 1 is: | 3 4 | | 7 8 | Local process on rank 2 is: | 9 10 | |13 14 | Local process on rank 3 is: |11 12 | |15 16 | Processed grid: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17