階層データ：各ノードのすべての子孫のリストを効率的に作成します

Question

大きなツリーを形成する複数の親子関係を表す2列のデータセットがあります。これを使用して、各ノードのすべての子孫の更新リストを作成したいと思います。

元の入力：

 child parent 1 2010 1000 7 2100 1000 5 2110 1000 3 3000 2110 2 3011 2010 4 3033 2100 0 3102 2010 6 3111 2110

関係のグラフィカルな描写：

期待される出力：

 descendant ancestor 0 2010 1000 1 2100 1000 2 2110 1000 3 3000 1000 4 3011 1000 5 3033 1000 6 3102 1000 7 3111 1000 8 3011 2010 9 3102 2010 10 3033 2100 11 3000 2110 12 3111 2110

もともと、DataFramesで再帰的なソリューションを使用することにしました。これは意図したとおりに機能しますが、Pandasは非常に非効率的です。私の研究により、NumPy配列（または他の単純なデータ構造）を使用した実装は（数千のレコード）。

データフレームを使用したソリューション：

import pandas as pd df = pd.DataFrame( { 'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100], 'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000] }, columns=['child', 'parent'] ) def get_ancestry_dataframe_flat(df): def get_child_list(parent_id): list_of_children = list() list_of_children.append(df[df['parent'] == parent_id]['child'].values) for i, r in df[df['parent'] == parent_id].iterrows(): if r['child'] != parent_id: list_of_children.append(get_child_list(r['child'])) # flatten list list_of_children = [item for sublist in list_of_children for item in sublist] return list_of_children new_df = pd.DataFrame(columns=['descendant', 'ancestor']).astype(int) for index, row in df.iterrows(): temp_df = pd.DataFrame(columns=['descendant', 'ancestor']) temp_df['descendant'] = pd.Series(get_child_list(row['parent'])) temp_df['ancestor'] = row['parent'] new_df = new_df.append(temp_df) new_df = new_df\ .drop_duplicates()\ .sort_values(['ancestor', 'descendant'])\ .reset_index(drop=True) return new_df

この方法でpandas DataFramesを使用すると、大きなデータセットでは非常に非効率的であるため、この操作のパフォーマンスを改善する必要があります。私の理解では、これはループと再帰に適したより効率的なデータ構造を使用することで実現できるということです。

具体的には、速度の最適化を求めています。

EFT · Accepted Answer

これは、numpyを使用して、一度に1世代ずつツリーを反復する方法です。

コード：

_import numpy as np import pandas as pd # only used to return a dataframe def list_ancestors(edges): """ Take Edge list of a rooted tree as a numpy array with shape (E, 2), child nodes in edges[:, 0], parent nodes in edges[:, 1] Return pandas dataframe of all descendant/ancestor node pairs Ex: df = pd.DataFrame({'child': [200, 201, 300, 301, 302, 400], 'parent': [100, 100, 200, 200, 201, 300]}) df child parent 0 200 100 1 201 100 2 300 200 3 301 200 4 302 201 5 400 300 list_ancestors(df.values) returns descendant ancestor 0 200 100 1 201 100 2 300 200 3 300 100 4 301 200 5 301 100 6 302 201 7 302 100 8 400 300 9 400 200 10 400 100 """ ancestors = [] for ar in trace_nodes(edges): ancestors.append(np.c_[np.repeat(ar[:, 0], ar.shape[1]-1), ar[:, 1:].flatten()]) return pd.DataFrame(np.concatenate(ancestors), columns=['descendant', 'ancestor']) def trace_nodes(edges): """ Take Edge list of a rooted tree as a numpy array with shape (E, 2), child nodes in edges[:, 0], parent nodes in edges[:, 1] Yield numpy array with cross-section of tree and associated ancestor nodes Ex: df = pd.DataFrame({'child': [200, 201, 300, 301, 302, 400], 'parent': [100, 100, 200, 200, 201, 300]}) df child parent 0 200 100 1 201 100 2 300 200 3 301 200 4 302 201 5 400 300 trace_nodes(df.values) yields array([[200, 100], [201, 100]]) array([[300, 200, 100], [301, 200, 100], [302, 201, 100]]) array([[400, 300, 200, 100]]) """ mask = np.in1d(edges[:, 1], edges[:, 0]) gen_branches = edges[~mask] edges = edges[mask] yield gen_branches while edges.size != 0: mask = np.in1d(edges[:, 1], edges[:, 0]) next_gen = edges[~mask] gen_branches = numpy_col_inner_many_to_one_join(next_gen, gen_branches) edges = edges[mask] yield gen_branches def numpy_col_inner_many_to_one_join(ar1, ar2): """ Take two 2-d numpy arrays ar1 and ar2, with no duplicate values in first column of ar2 Return inner join of ar1 and ar2 on last column of ar1, first column of ar2 Ex: ar1 = np.array([[1, 2, 3], [4, 5, 3], [6, 7, 8], [9, 10, 11]]) ar2 = np.array([[ 1, 2], [ 3, 4], [ 5, 6], [ 7, 8], [ 9, 10], [11, 12]]) numpy_col_inner_many_to_one_join(ar1, ar2) returns array([[ 1, 2, 3, 4], [ 4, 5, 3, 4], [ 9, 10, 11, 12]]) """ ar1 = ar1[np.in1d(ar1[:, -1], ar2[:, 0])] ar2 = ar2[np.in1d(ar2[:, 0], ar1[:, -1])] if 'int' in ar1.dtype.name and ar1[:, -1].min() >= 0: bins = np.bincount(ar1[:, -1]) counts = bins[bins.nonzero()[0]] else: counts = np.unique(ar1[:, -1], False, False, True)[1] left = ar1[ar1[:, -1].argsort()] right = ar2[ar2[:, 0].argsort()] return np.concatenate([left[:, :-1], right[np.repeat(np.arange(right.shape[0]), counts)]], 1) _

タイミング比較：

@ taky2が提供するテストケース1と2、それぞれ、高木構造と幅の広い木構造のパフォーマンスを比較するテストケース3と4-ほとんどのユースケースは中央のどこかにあります。

_df = pd.DataFrame( { 'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100], 'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000] } ) df2 = pd.DataFrame( { 'child': [4321, 3102, 4023, 2010, 5321, 4200, 4113, 6525, 4010, 4001, 3011, 5010, 3000, 3033, 2110, 6100, 3111, 2100, 6016, 4311], 'parent': [3111, 2010, 3000, 1000, 4023, 3011, 3033, 5010, 3011, 3102, 2010, 4023, 2110, 2100, 1000, 5010, 2110, 1000, 5010, 3033] } ) df3 = pd.DataFrame(np.r_[np.c_[np.arange(1, 501), np.arange(500)], np.c_[np.arange(501, 1001), np.arange(500)]], columns=['child', 'parent']) df4 = pd.DataFrame(np.r_[np.c_[np.arange(1, 101), np.repeat(0, 100)], np.c_[np.arange(1001, 11001), np.repeat(np.arange(1, 101), 100)]], columns=['child', 'parent']) %timeit get_ancestry_dataframe_flat(df) 10 loops, best of 3: 53.4 ms per loop %timeit add_children_of_children(df) 1000 loops, best of 3: 1.13 ms per loop %timeit all_descendants_nx(df) 1000 loops, best of 3: 675 µs per loop %timeit list_ancestors(df.values) 1000 loops, best of 3: 391 µs per loop %timeit get_ancestry_dataframe_flat(df2) 10 loops, best of 3: 168 ms per loop %timeit add_children_of_children(df2) 1000 loops, best of 3: 1.8 ms per loop %timeit all_descendants_nx(df2) 1000 loops, best of 3: 1.06 ms per loop %timeit list_ancestors(df2.values) 1000 loops, best of 3: 933 µs per loop %timeit add_children_of_children(df3) 10 loops, best of 3: 156 ms per loop %timeit all_descendants_nx(df3) 1 loop, best of 3: 952 ms per loop %timeit list_ancestors(df3.values) 10 loops, best of 3: 104 ms per loop %timeit add_children_of_children(df4) 1 loop, best of 3: 503 ms per loop %timeit all_descendants_nx(df4) 1 loop, best of 3: 238 ms per loop %timeit list_ancestors(df4.values) 100 loops, best of 3: 2.96 ms per loop _

注：

_get_ancestry_dataframe_flat_は、時間とメモリの問題のため、ケース3と4でタイミングが取れません。

_add_children_of_children_は、ルートノードを内部的に識別するために変更されましたが、一意のルートを引き受けることができます。最初の行root_node = (set(dataframe.parent) - set(dataframe.child)).pop()が追加されました。

_all_descendants_nx_は、外部ネームスペースからプルする代わりに、データフレームを引数として受け入れるように変更されました。

適切な動作を示す例：

_np.all(get_ancestry_dataframe_flat(df2).sort_values(['descendant', 'ancestor'])\ .reset_index(drop=True) ==\ list_ancestors(df2.values).sort_values(['descendant', 'ancestor'])\ .reset_index(drop=True)) Out[20]: True _

Stephen Rauch · Answer

ツリーのナビゲーションを容易にするための辞書を作成するメソッドを次に示します。次に、ツリーを1回実行して、子を祖父母以上に追加します。最後に、新しいデータをデータフレームに追加します。

コード：

_def add_children_of_children(dataframe, root_node): # build a dict of lists to allow easy tree descent tree = {} for idx, (child, parent) in dataframe.iterrows(): tree.setdefault(parent, []).append(child) data = [] def descend_tree(parent): # get list of children of this parent children = tree[parent] # reverse order so that we can modify the list while looping for child in reversed(children): if child in tree: # descend tree and find children which need to be added lower_children = descend_tree(child) # add children from below to parent at this level data.extend([(c, parent) for c in lower_children]) # return lower children to parents above children.extend(lower_children) return children descend_tree(root_node) return dataframe.append( pd.DataFrame(data, columns=dataframe.columns)) _

タイミング：

テストコードには、timeitの実行から数秒で3つのテストメソッドがあります。

0.073-add_children_of_children()上から。
0.153-add_children_of_children()出力がソートされています。
3.385-元のget_ancestry_dataframe_flat() pandas実装。

したがって、ネイティブのデータ構造アプローチは、元の実装よりもかなり高速です。

テストコード：

_import pandas as pd df = pd.DataFrame( { 'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100], 'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000] }, columns=['child', 'parent'] ) def method1(): # the root node is the node which is not a child root = set(df.parent) - set(df.child) assert len(root) == 1, "Number of roots != 1 '{}'".format(root) return add_children_of_children(df, root.pop()) def method2(): dataframe = method1() names = ['ancestor', 'descendant'] rename = {o: n for o, n in Zip(dataframe.columns, reversed(names))} return dataframe.rename(columns=rename) \ .sort_values(names).reset_index(drop=True) def method3(): return get_ancestry_dataframe_flat(df) def get_ancestry_dataframe_flat(df): def get_child_list(parent_id): list_of_children = list() list_of_children.append( df[df['parent'] == parent_id]['child'].values) for i, r in df[df['parent'] == parent_id].iterrows(): if r['child'] != parent_id: list_of_children.append(get_child_list(r['child'])) # flatten list list_of_children = [ item for sublist in list_of_children for item in sublist] return list_of_children new_df = pd.DataFrame(columns=['descendant', 'ancestor']).astype(int) for index, row in df.iterrows(): temp_df = pd.DataFrame(columns=['descendant', 'ancestor']) temp_df['descendant'] = pd.Series(get_child_list(row['parent'])) temp_df['ancestor'] = row['parent'] new_df = new_df.append(temp_df) new_df = new_df\ .drop_duplicates()\ .sort_values(['ancestor', 'descendant'])\ .reset_index(drop=True) return new_df print(method2()) print(method3()) from timeit import timeit print(timeit(method1, number=50)) print(timeit(method2, number=50)) print(timeit(method3, number=50)) _

試験結果：

_ descendant ancestor 0 2010 1000 1 2100 1000 2 2110 1000 3 3000 1000 4 3011 1000 5 3033 1000 6 3102 1000 7 3111 1000 8 3011 2010 9 3102 2010 10 3033 2100 11 3000 2110 12 3111 2110 descendant ancestor 0 2010 1000 1 2100 1000 2 2110 1000 3 3000 1000 4 3011 1000 5 3033 1000 6 3102 1000 7 3111 1000 8 3011 2010 9 3102 2010 10 3033 2100 11 3000 2110 12 3111 2110 0.0737142168563 0.153700592966 3.38558308083 _

Kevin · Answer

networkx を使用するソリューションでは、ドキュメントにもっと効率的な方法があるかもしれませんが、このネストされたループはトリックを行います。

import pandas as pd from timeit import timeit df = pd.DataFrame( { 'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100], 'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000] }, columns=['child', 'parent'] )

Networkx 2.0では、 from_pandas_edgelist を使用して有向グラフを作成します。

import networkx as nx Dig = nx.from_pandas_edgelist(df, 'parent', 'child', create_using=nx.DiGraph())

各ノードの nodes および ancestors を繰り返し処理します。

for n1 in Dig.nodes(): for n2 in nx.ancestors(Dig, n1): print(n1,n2) 3000 1000 3000 2110 3011 1000 3011 2010 2100 1000 2110 1000 3111 1000 3111 2110 3033 1000 3033 2100 2010 1000 3102 1000 3102 2010

関数にラップ：

def all_descendants_nx(): Dig = nx.from_pandas_edgelist(df,'parent','child',create_using=nx.DiGraph()) return pd.DataFrame.from_records([(n1,n2) for n1 in Dig.nodes() for n2 in nx.ancestors(Dig, n1)], columns=['descendant','ancestor']) print(timeit(all_descendants_nx, number=50)) #to compare to Stephen's Nice answer 0.05033063516020775 all_descendants_nx() descendant ancestor 0 3000 1000 1 3000 2110 2 3011 1000 3 3011 2010 4 2100 1000 5 2110 1000 6 3111 1000 7 3111 2110 8 3033 1000 9 3033 2100 10 2010 1000 11 3102 1000 12 3102 2010

Vaishali · Answer

Isin（）とmapを使用する1つの方法を次に示します

df_new = df.append(df[df['parent'].isin(df['child'].values.tolist())])\ .reset_index(drop = True) df_new.loc[df_new.duplicated(), 'parent'] = df_new.loc[df_new.duplicated(), 'parent']\ .map(df.set_index('child')['parent']) df_new = df_new.sort_values('parent').reset_index(drop=True) df_new.columns = [' descendant' , 'ancestor']

あなたが得る

 descendant ancestor 0 2010 1000 1 2100 1000 2 2110 1000 3 3000 1000 4 3011 1000 5 3033 1000 6 3102 1000 7 3111 1000 8 3011 2010 9 3102 2010 10 3033 2100 11 3000 2110 12 3111 2110