並列処理はデータセットマップの時間を短縮していません

Question

TF Map関数は並列呼び出しをサポートします。 num_parallel_callsをマップに渡すと改善が見られません。 num_parallel_calls=1およびnum_parallel_calls=10を使用しても、パフォーマンスの実行時間は改善されません。これが簡単なコードです

import time def test_two_custom_function_parallelism(num_parallel_calls=1, batch=False, batch_size=1, repeat=1, num_iterations=10): tf.reset_default_graph() start = time.time() dataset_x = tf.data.Dataset.range(1000).map(lambda x: tf.py_func( squarer, [x], [tf.int64]), num_parallel_calls=num_parallel_calls).repeat(repeat) if batch: dataset_x = dataset_x.batch(batch_size) dataset_y = tf.data.Dataset.range(1000).map(lambda x: tf.py_func( squarer, [x], [tf.int64]), num_parallel_calls=num_parallel_calls).repeat(repeat) if batch: dataset_y = dataset_x.batch(batch_size) X = dataset_x.make_one_shot_iterator().get_next() Y = dataset_x.make_one_shot_iterator().get_next() with tf.Session() as sess: sess.run(tf.global_variables_initializer()) i = 0 while True: try: res = sess.run([X, Y]) i += 1 if i == num_iterations: break except tf.errors.OutOfRangeError as e: pass

これがタイミングです

%timeit test_two_custom_function_parallelism(num_iterations=1000, num_parallel_calls=2, batch_size=2, batch=True) 370ms %timeit test_two_custom_function_parallelism(num_iterations=1000, num_parallel_calls=5, batch_size=2, batch=True) 372ms %timeit test_two_custom_function_parallelism(num_iterations=1000, num_parallel_calls=10, batch_size=2, batch=True) 384ms

Juypterノートブックで%timeitを使用しました。私はそれを間違っているのですか？

mrry · Answer

ここでの問題は、Dataset.map()関数の唯一の操作が tf.py_func() opであるということです。この操作は、ローカルのPythonインタープリターを呼び出して、同じプロセスで関数を実行します。 _num_parallel_calls_を増やすと、同時にPythonにコールバックしようとするTensorFlowスレッドの数が増えます。ただし、Pythonには "Global Interpreter Lock" と呼ばれるものがあり、複数のスレッドが一度にコードを実行するのを防ぎます。その結果、これらの複数の並列呼び出しの1つを除くすべてがブロックされ、グローバルインタープリターロックの取得を待機します。並列の高速化はほとんどありません（おそらくわずかな速度低下もあります）。

コード例にはsquarer()関数の定義が含まれていませんでしたが、tf.py_func()をC++で実装され、並行して実行できる純粋なTensorFlowopsに置き換えることができる場合があります。。たとえば、名前から推測するだけで、これを tf.square(x) の呼び出しに置き換えることができます。そうすれば、並行して高速化を楽しむことができます。

ただし、単一の整数を2乗するなど、関数の作業量が少ない場合、スピードアップはそれほど大きくない可能性があることに注意してください。並列Dataset.map()は、TFRecordをtf.parse_single_example()で解析したり、データ拡張パイプラインの一部として画像の歪みを実行したりするなど、より重い操作に役立ちます。

quanly_mc · Answer

その理由は、おそらく二乗の方がオーバーヘッド時間よりも時間がかからないからです。 2秒かかる1/4関数を追加してコードを変更しました。次に、パラメーターnum_parallel_callsは期待どおりに機能します。完全なコードは次のとおりです。

import tensorflow as tf import time def squarer(x): t0 = time.time() while time.time() - t0 < 2: y = x ** 2 return y def test_two_custom_function_parallelism(num_parallel_calls=1, batch=False, batch_size=1, repeat=1, num_iterations=10): tf.reset_default_graph() start = time.time() dataset_x = tf.data.Dataset.range(1000).map( lambda x: tf.py_func(squarer, [x], [tf.int64]), num_parallel_calls=num_parallel_calls).repeat(repeat) # dataset_x = dataset_x.prefetch(4) if batch: dataset_x = dataset_x.batch(batch_size) dataset_y = tf.data.Dataset.range(1000).map( lambda x: tf.py_func(squarer, [x], [tf.int64]), num_parallel_calls=num_parallel_calls).repeat(repeat) # dataset_y = dataset_y.prefetch(4) if batch: dataset_y = dataset_x.batch(batch_size) X = dataset_x.make_one_shot_iterator().get_next() Y = dataset_x.make_one_shot_iterator().get_next() with tf.Session() as sess: sess.run(tf.global_variables_initializer()) i = 0 while True: t0 = time.time() try: res = sess.run([X, Y]) print(res) i += 1 if i == num_iterations: break except tf.errors.OutOfRangeError as e: print(i) break print('step elapse: %.4f' % (time.time() - t0)) print('total time: %.4f' % (time.time() - start)) test_two_custom_function_parallelism( num_iterations=4, num_parallel_calls=1, batch_size=2, batch=True, repeat=10) test_two_custom_function_parallelism( num_iterations=4, num_parallel_calls=10, batch_size=2, batch=True, repeat=10)

出力は次のとおりです。

[(array([0, 1]),), (array([0, 1]),)] step elapse: 4.0204 [(array([4, 9]),), (array([4, 9]),)] step elapse: 4.0836 [(array([16, 25]),), (array([16, 25]),)] step elapse: 4.1529 [(array([36, 49]),), (array([36, 49]),)] total time: 16.3374 [(array([0, 1]),), (array([0, 1]),)] step elapse: 2.2139 [(array([4, 9]),), (array([4, 9]),)] step elapse: 0.0585 [(array([16, 25]),), (array([16, 25]),)] step elapse: 0.0469 [(array([36, 49]),), (array([36, 49]),)] total time: 2.5317

そのため、@ mrryが言及した「グローバルインタプリタロック」の効果と混同しています。

golmschenk · Answer

独自のバージョンのmapをセットアップして、TensorFlowのDataset.mapに似たものを取得しますが、py_functionsに複数のCPUを使用します。

使用法

の代わりに

mapped_dataset = my_dataset.map(lambda x: tf.py_function(my_function, [x], [tf.float64]), num_parallel_calls=16)

以下のコードで、CPUパラレルpy_functionバージョンを使用して取得できます

mapped_dataset = map_py_function_to_dataset(my_dataset, my_function, number_of_parallel_calls=16)

（py_functionの出力タイプは、単一でない場合は指定することもできますtf.float32）

内部的には、これによりmultiprocessingワーカーのプールが作成されます。通常のGIL限定のTensorFlowmapを引き続き使用しますが、入力をワーカーに渡し、出力を取得するためにのみ使用します。データを処理するワーカーは、CPU上で並行して実行されます。

警告

渡される関数は、multiprocessingプールで機能するために picklable である必要があります。これはほとんどの場合に機能するはずですが、一部のクロージャなどは失敗する可能性があります。 dillのようなパッケージはこの制限を緩和するかもしれませんが、私はそれを調べていません。

オブジェクトのメソッドを関数として渡す場合は、プロセス間でオブジェクトがどのように複製されるかについても注意する必要があります（各プロセスにはオブジェクトの独自のコピーがあるため、共有されている属性に依存することはできません）。

これらの考慮事項を念頭に置いている限り、このコードは多くの場合に機能するはずです。

コード

""" Code for TensorFlow's `Dataset` class which allows for multiprocessing in CPU map functions. """ import multiprocessing from typing import Callable, Union, List import signal import tensorflow as tf class PyMapper: """ A class which allows for mapping a py_function to a TensorFlow dataset in parallel on CPU. """ def __init__(self, map_function: Callable, number_of_parallel_calls: int): self.map_function = map_function self.number_of_parallel_calls = number_of_parallel_calls self.pool = multiprocessing.Pool(self.number_of_parallel_calls, self.pool_worker_initializer) @staticmethod def pool_worker_initializer(): """ Used to initialize each worker process. """ # Corrects bug where worker instances catch and throw away keyboard interrupts. signal.signal(signal.SIGINT, signal.SIG_IGN) def send_to_map_pool(self, element_tensor): """ Sends the tensor element to the pool for processing. :param element_tensor: The element to be processed by the pool. :return: The output of the map function on the element. """ result = self.pool.apply_async(self.map_function, (element_tensor,)) mapped_element = result.get() return mapped_element def map_to_dataset(self, dataset: tf.data.Dataset, output_types: Union[List[tf.dtypes.DType], tf.dtypes.DType] = tf.float32): """ Maps the map function to the passed dataset. :param dataset: The dataset to apply the map function to. :param output_types: The TensorFlow output types of the function to convert to. :return: The mapped dataset. """ def map_py_function(*args): """A py_function wrapper for the map function.""" return tf.py_function(self.send_to_map_pool, args, output_types) return dataset.map(map_py_function, self.number_of_parallel_calls) def map_py_function_to_dataset(dataset: tf.data.Dataset, map_function: Callable, number_of_parallel_calls: int, output_types: Union[List[tf.dtypes.DType], tf.dtypes.DType] = tf.float32 ) -> tf.data.Dataset: """ A one line wrapper to allow mapping a parallel py function to a dataset. :param dataset: The dataset whose elements the mapping function will be applied to. :param map_function: The function to map to the dataset. :param number_of_parallel_calls: The number of parallel calls of the mapping function. :param output_types: The TensorFlow output types of the function to convert to. :return: The mapped dataset. """ py_mapper = PyMapper(map_function=map_function, number_of_parallel_calls=number_of_parallel_calls) mapped_dataset = py_mapper.map_to_dataset(dataset=dataset, output_types=output_types) return mapped_dataset