Pythonのconcurrent.futuresでBrokenProcessPoolの原因を見つける

Question

一言で言えば

コードをconcurrent.futuresと並列化すると、BrokenProcessPool例外が発生します。それ以上のエラーは表示されません。エラーの原因を突き止め、その方法を考えてみたいと思います。

完全な問題

私は concurrent.futures を使用していくつかのコードを並列化します。

with ProcessPoolExecutor() as pool: mapObj = pool.map(myMethod, args)

私は次の例外で終わります（そしてそれだけで）：

concurrent.futures.process.BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore

残念ながら、プログラムは複雑であり、エラーはプログラムが30分間実行された後にのみ表示されます。したがって、私はニースの最小限の例を提供することはできません。

問題の原因を見つけるために、try-except-blockと並行して実行するメソッドをラップしました。

def myMethod(*args): try: ... except Exception as e: print(e)

問題は同じままで、例外ブロックは入力されませんでした。私は、例外は私のコードから来たものではないと結論付けています。

次のステップは、元のProcessPoolExecutorの子であるカスタムProcessPoolExecutorクラスを作成し、いくつかのメソッドをカスタマイズされたメソッドに置き換えることでした。メソッド_process_workerの元のコードをコピーして貼り付け、いくつかのprintステートメントを追加しました。

def _process_worker(call_queue, result_queue): """Evaluates calls from call_queue and places the results in result_queue. ... """ while True: call_item = call_queue.get(block=True) if call_item is None: # Wake up queue management thread result_queue.put(os.getpid()) return try: r = call_item.fn(*call_item.args, **call_item.kwargs) except BaseException as e: print("??? Exception ???") # newly added print(e) # newly added exc = _ExceptionWithTraceback(e, e.__traceback__) result_queue.put(_ResultItem(call_item.work_id, exception=exc)) else: result_queue.put(_ResultItem(call_item.work_id, result=r))

この場合も、exceptブロックは入力されません。コードで例外が発生しないことをすでに確認しているため、これは予想どおりでした（すべてが正常に機能した場合は、例外をメインプロセスに渡す必要があります）。

今、私はどのようにしてエラーを見つけることができるかについての考えを欠いています。ここで例外が発生します：

def submit(self, fn, *args, **kwargs): with self._shutdown_lock: if self._broken: raise BrokenProcessPool('A child process terminated ' 'abruptly, the process pool is not usable anymore') if self._shutdown_thread: raise RuntimeError('cannot schedule new futures after shutdown') f = _base.Future() w = _WorkItem(f, fn, args, kwargs) self._pending_work_items[self._queue_count] = w self._work_ids.put(self._queue_count) self._queue_count += 1 # Wake up queue management thread self._result_queue.put(None) self._start_queue_management_thread() return f

プロセスプールはここで壊れるように設定されています：

def _queue_management_worker(executor_reference, processes, pending_work_items, work_ids_queue, call_queue, result_queue): """Manages the communication between this process and the worker processes. ... """ executor = None def shutting_down(): return _shutdown or executor is None or executor._shutdown_thread def shutdown_worker(): ... reader = result_queue._reader while True: _add_call_item_to_queue(pending_work_items, work_ids_queue, call_queue) sentinels = [p.sentinel for p in processes.values()] assert sentinels ready = wait([reader] + sentinels) if reader in ready: result_item = reader.recv() else: #THIS BLOCK IS ENTERED WHEN THE ERROR OCCURS # Mark the process pool broken so that submits fail right now. executor = executor_reference() if executor is not None: executor._broken = True executor._shutdown_thread = True executor = None # All futures in flight must be marked failed for work_id, work_item in pending_work_items.items(): work_item.future.set_exception( BrokenProcessPool( "A process in the process pool was " "terminated abruptly while the future was " "running or pending." )) # Delete references to object. See issue16284 del work_item pending_work_items.clear() # Terminate remaining workers forcibly: the queues or their # locks may be in a dirty state and block forever. for p in processes.values(): p.terminate() shutdown_worker() return ...

プロセスが終了するのは事実です（またはそうであるようです）が、理由はわかりません。私の考えは今のところ正しいですか？メッセージなしでプロセスを終了させる考えられる原因は何ですか？（これも可能ですか？）さらに診断をどこに適用できますか？解決策に近づくために、どの質問を自問する必要がありますか？

私は64ビットLinuxでpython 3.5を使用しています。

Samufi · Answer

私は可能な限り得ることができたと思います：

変更したProcessPoolExecutorモジュールの_queue_management_workerメソッドを変更して、失敗したプロセスの終了コードが出力されるようにしました。

def _queue_management_worker(executor_reference, processes, pending_work_items, work_ids_queue, call_queue, result_queue): """Manages the communication between this process and the worker processes. ... """ executor = None def shutting_down(): return _shutdown or executor is None or executor._shutdown_thread def shutdown_worker(): ... reader = result_queue._reader while True: _add_call_item_to_queue(pending_work_items, work_ids_queue, call_queue) sentinels = [p.sentinel for p in processes.values()] assert sentinels ready = wait([reader] + sentinels) if reader in ready: result_item = reader.recv() else: # BLOCK INSERTED FOR DIAGNOSIS ONLY --------- vals = list(processes.values()) for s in ready: j = sentinels.index(s) print("is_alive()", vals[j].is_alive()) print("exitcode", vals[j].exitcode) # ------------------------------------------- # Mark the process pool broken so that submits fail right now. executor = executor_reference() if executor is not None: executor._broken = True executor._shutdown_thread = True executor = None # All futures in flight must be marked failed for work_id, work_item in pending_work_items.items(): work_item.future.set_exception( BrokenProcessPool( "A process in the process pool was " "terminated abruptly while the future was " "running or pending." )) # Delete references to object. See issue16284 del work_item pending_work_items.clear() # Terminate remaining workers forcibly: the queues or their # locks may be in a dirty state and block forever. for p in processes.values(): p.terminate() shutdown_worker() return ...

その後、終了コードの意味を調べました。

from multiprocessing.process import _exitcode_to_name print(_exitcode_to_name[my_exit_code])

ここで、my_exit_codeは、_queue_management_workerに挿入したブロックに出力された終了コードです。私の場合、コードは-11でした。これは、セグメンテーション違反に遭遇したことを意味します。この問題の理由を見つけることは大きな作業になりますが、この質問の範囲を超えています。

gowthamnvv · Answer

MacOSを使用している場合、一部のバージョンのmacOSがフォークセーフと見なされないフォークを使用する方法に既知の問題があります。これは、一部のシナリオではPython）によって機能します。環境変数。

〜/ .bash_profileを編集し、以下を含めます（*ではなく、ここでドメインまたはサブネットのリストを指定する方がよい場合があります）

no_proxy='*'

現在のコンテキストを更新します

source ~/.bash_profile

私のローカルバージョンで問題が発生し、回避されたのは次のとおりです：Python 3.6.0 on macOS 10.14.1 and 10.13.x

出典： Issue 30388 Issue 27126