マルチプロセッシングに共有メモリのnumpy配列を使用

Question

マルチプロセッシングモジュールで使用するために、共有メモリのnumpy配列を使用したいと思います。困難なのは、単にctypes配列としてではなく、numpy配列のように使用することです。

_from multiprocessing import Process, Array import scipy def f(a): a[0] = -a[0] if __== '__main__': # Create the array N = int(10) unshared_arr = scipy.Rand(N) arr = Array('d', unshared_arr) print "Originally, the first two elements of arr = %s"%(arr[:2]) # Create, start, and finish the child processes p = Process(target=f, args=(arr,)) p.start() p.join() # Printing out the changed values print "Now, the first two elements of arr = %s"%arr[:2] _

これにより、次のような出力が生成されます。

_Originally, the first two elements of arr = [0.3518653236697369, 0.517794725524976] Now, the first two elements of arr = [-0.3518653236697369, 0.517794725524976] _

配列にはctypesの方法でアクセスできます。 _arr[i]_は理にかなっています。ただし、これはnumpy配列ではないため、_-1*arr_やarr.sum()などの操作を実行できません。解決策はctypes配列をnumpy配列に変換することだと思います。しかし（この作品を作ることができないことを除いて）、それがもう共有されるとは思わない。

共通の問題にならなければならないものに対する標準的な解決策があるようです。

jfs · Answer

@unutbu（もう利用できません）と@Henry Gomersallの回答に追加します。必要に応じて、shared_arr.get_lock()を使用してアクセスを同期できます。

_shared_arr = mp.Array(ctypes.c_double, N) # ... def f(i): # could be anything numpy accepts as an index such another numpy array with shared_arr.get_lock(): # synchronize access arr = np.frombuffer(shared_arr.get_obj()) # no data copying arr[i] = -arr[i] _

例

_import ctypes import logging import multiprocessing as mp from contextlib import closing import numpy as np info = mp.get_logger().info def main(): logger = mp.log_to_stderr() logger.setLevel(logging.INFO) # create shared array N, M = 100, 11 shared_arr = mp.Array(ctypes.c_double, N) arr = tonumpyarray(shared_arr) # fill with random values arr[:] = np.random.uniform(size=N) arr_orig = arr.copy() # write to arr from different processes with closing(mp.Pool(initializer=init, initargs=(shared_arr,))) as p: # many processes access the same slice stop_f = N // 10 p.map_async(f, [slice(stop_f)]*M) # many processes access different slices of the same array assert M % 2 # odd step = N // 10 p.map_async(g, [slice(i, i + step) for i in range(stop_f, N, step)]) p.join() assert np.allclose(((-1)**M)*tonumpyarray(shared_arr), arr_orig) def init(shared_arr_): global shared_arr shared_arr = shared_arr_ # must be inherited, not passed as an argument def tonumpyarray(mp_arr): return np.frombuffer(mp_arr.get_obj()) def f(i): """synchronized.""" with shared_arr.get_lock(): # synchronize access g(i) def g(i): """no synchronization.""" info("start %s" % (i,)) arr = tonumpyarray(shared_arr) arr[i] = -1 * arr[i] info("end %s" % (i,)) if __== '__main__': mp.freeze_support() main() _

同期アクセスが不要な場合、または独自のロックを作成する場合は、mp.Array()は不要です。この場合は_mp.sharedctypes.RawArray_を使用できます。

Henry Gomersall · Answer

Arrayオブジェクトにはget_obj()メソッドが関連付けられており、これはバッファーインターフェイスを表すctypes配列を返します。私は以下がうまくいくと思う...

from multiprocessing import Process, Array import scipy import numpy def f(a): a[0] = -a[0] if __== '__main__': # Create the array N = int(10) unshared_arr = scipy.Rand(N) a = Array('d', unshared_arr) print "Originally, the first two elements of arr = %s"%(a[:2]) # Create, start, and finish the child process p = Process(target=f, args=(a,)) p.start() p.join() # Print out the changed values print "Now, the first two elements of arr = %s"%a[:2] b = numpy.frombuffer(a.get_obj()) b[0] = 10.0 print a[0]

実行すると、aの最初の要素が10.0になり、aとbが同じメモリ内の2つのビューにすぎないことがわかります。

それがまだマルチプロセッサで安全であることを確認するには、acquireオブジェクト、release、およびその組み込みロックに存在するArrayおよびaメソッドを使用して、すべてが安全にアクセスされるようにする必要があります（私はそうではありませんが）マルチプロセッサモジュールの専門家）。

EelkeSpaak · Answer

すでに与えられた答えは良いですが、2つの条件が満たされていれば、この問題に対するはるかに簡単な解決策があります。

POSIX準拠オペレーティングシステム（Linux、Mac OSXなど）を使用している;そして
子プロセスには、共有アレイへの読み取り専用アクセスが必要です。

この場合、子プロセスはforkを使用して作成されるため、変数を明示的に共有する必要はありません。分岐した子は、親のメモリスペースを自動的に共有します。 Python multiprocessingのコンテキストでは、これはすべてのmodule-level変数を共有することを意味します;これは、子プロセスまたはmultiprocessing.Poolなどで呼び出す関数に明示的に渡す引数に対して保持されません。

簡単な例：

import multiprocessing import numpy as np # will hold the (implicitly mem-shared) data data_array = None # child worker function def job_handler(num): # built-in id() returns unique memory ID of a variable return id(data_array), np.sum(data_array) def launch_jobs(data, num_jobs=5, num_worker=4): global data_array data_array = data pool = multiprocessing.Pool(num_worker) return pool.map(job_handler, range(num_jobs)) # create some random data and execute the child jobs mem_ids, sumvals = Zip(*launch_jobs(np.random.Rand(10))) # this will print 'True' on POSIX OS, since the data was shared print(np.all(np.asarray(mem_ids) == id(data_array)))

mat · Answer

POSIX共有メモリを使用してpythonインタープリター間でnumpy配列を共有する小さなモジュールpythonモジュールを作成しました。多分便利でしょう。

https://pypi.python.org/pypi/SharedArray

仕組みは次のとおりです。

import numpy as np import SharedArray as sa # Create an array in shared memory a = sa.create("test1", 10) # Attach it as a different array. This can be done from another # python interpreter as long as it runs on the same computer. b = sa.attach("test1") # See how they are actually sharing the same memory block a[0] = 42 print(b[0]) # Destroying a does not affect b. del a print(b[0]) # See how "test1" is still present in shared memory even though we # destroyed the array a. sa.list() # Now destroy the array "test1" from memory. sa.delete("test1") # The array b is not affected, but once you destroy it then the # data are lost. print(b[0])

Velimir Mlaker · Answer

sharedmemモジュールを使用できます。 https://bitbucket.org/cleemesser/numpy-sharedmem

次に、元のコードを示します。今回は、NumPy配列のように動作する共有メモリを使用しています（NumPy sum()関数を呼び出す最後の追加ステートメントに注意してください）。

from multiprocessing import Process import sharedmem import scipy def f(a): a[0] = -a[0] if __== '__main__': # Create the array N = int(10) unshared_arr = scipy.Rand(N) arr = sharedmem.empty(N) arr[:] = unshared_arr.copy() print "Originally, the first two elements of arr = %s"%(arr[:2]) # Create, start, and finish the child process p = Process(target=f, args=(arr,)) p.start() p.join() # Print out the changed values print "Now, the first two elements of arr = %s"%arr[:2] # Perform some NumPy operation print arr.sum()