sklearn.AgglomerativeClusteringを使用して樹状図をプロットする

Question

AgglomerativeClusteringが提供するchildren_属性を使用して樹状図を作成しようとしていますが、これまでのところ運が悪いです。 scipyで提供される凝集クラスタリングには、私にとって重要ないくつかのオプション（クラスターの量を指定するオプションなど）がないため、scipy.clusterは使用できません。私はそこにアドバイスをしてくれて本当に感謝しています。

 import sklearn.cluster clstr = cluster.AgglomerativeClustering(n_clusters=2) clusterer.children_

David Diaz · Answer

Sklearnから階層クラスタリングモデルを取得し、scipy dendrogram関数を使用してプロットするための単純関数を次に示します。グラフ関数は、sklearnで直接サポートされていないようです。このplot_dendrogramコードスニペット here のプルリクエストに関連する興味深い議論を見つけることができます。

説明するユースケース（クラスターの数を定義する）はscipyで利用できることを明確にします：scipyのlinkageを使用して階層クラスタリングを実行した後、使用するクラスターの数に応じて階層をカットできますfcluster引数とcriterion='maxclust'引数で指定されたクラスターの数を含むt。

sebastianspiegel · Answer

代わりに、凝集クラスタリングのscipy実装を使用してください。以下に例を示します。

from scipy.cluster.hierarchy import dendrogram, linkage data = [[0., 0.], [0.1, -0.1], [1., 1.], [1.1, 1.1]] Z = linkage(data) dendrogram(Z)

linkage here のドキュメントとdendrogram here のドキュメントを見つけることができます。

lucianopaz · Answer

しばらく前にまったく同じ問題に遭遇しました。いまいましいデンドグラムをプロットする方法は、ソフトウェアパッケージ ete を使用していました。このパッケージは、さまざまなオプションで柔軟にツリーをプロットできます。唯一の問題は、sklearnのchildren_への出力 Newick Tree形式これは、ete3。さらに、樹状突起のスパンは手動で計算する必要があります。その情報はchildren_。ここに私が使用したコードのスニペットがあります。 Newickツリーを計算し、ete3ツリーのデータ構造。プロット方法の詳細については、こちらをご覧ください

import numpy as np from sklearn.cluster import AgglomerativeClustering import ete3 def build_Newick_tree(children,n_leaves,X,leaf_labels,spanner): """ build_Newick_tree(children,n_leaves,X,leaf_labels,spanner) Get a string representation (Newick tree) from the sklearn AgglomerativeClustering.fit output. Input: children: AgglomerativeClustering.children_ n_leaves: AgglomerativeClustering.n_leaves_ X: parameters supplied to AgglomerativeClustering.fit leaf_labels: The label of each parameter array in X spanner: Callable that computes the dendrite's span Output: ntree: A str with the Newick tree representation """ return go_down_tree(children,n_leaves,X,leaf_labels,len(children)+n_leaves-1,spanner)[0]+';' def go_down_tree(children,n_leaves,X,leaf_labels,nodename,spanner): """ go_down_tree(children,n_leaves,X,leaf_labels,nodename,spanner) Iterative function that traverses the subtree that descends from nodename and returns the Newick representation of the subtree. Input: children: AgglomerativeClustering.children_ n_leaves: AgglomerativeClustering.n_leaves_ X: parameters supplied to AgglomerativeClustering.fit leaf_labels: The label of each parameter array in X nodename: An int that is the intermediate node name whos children are located in children[nodename-n_leaves]. spanner: Callable that computes the dendrite's span Output: ntree: A str with the Newick tree representation """ nodeindex = nodename-n_leaves if nodename<n_leaves: return leaf_labels[nodeindex],np.array([X[nodeindex]]) else: node_children = children[nodeindex] branch0,branch0samples = go_down_tree(children,n_leaves,X,leaf_labels,node_children[0]) branch1,branch1samples = go_down_tree(children,n_leaves,X,leaf_labels,node_children[1]) node = np.vstack((branch0samples,branch1samples)) branch0span = spanner(branch0samples) branch1span = spanner(branch1samples) nodespan = spanner(node) branch0distance = nodespan-branch0span branch1distance = nodespan-branch1span nodename = '({branch0}:{branch0distance},{branch1}:{branch1distance})'.format(branch0=branch0,branch0distance=branch0distance,branch1=branch1,branch1distance=branch1distance) return nodename,node def get_cluster_spanner(aggClusterer): """ spanner = get_cluster_spanner(aggClusterer) Input: aggClusterer: sklearn.cluster.AgglomerativeClustering instance Get a callable that computes a given cluster's span. To compute a cluster's span, call spanner(cluster) The cluster must be a 2D numpy array, where the axis=0 holds separate cluster members and the axis=1 holds the different variables. """ if aggClusterer.linkage=='ward': if aggClusterer.affinity=='euclidean': spanner = lambda x:np.sum((x-aggClusterer.pooling_func(x,axis=0))**2) Elif aggClusterer.linkage=='complete': if aggClusterer.affinity=='euclidean': spanner = lambda x:np.max(np.sum((x[:,None,:]-x[None,:,:])**2,axis=2)) Elif aggClusterer.affinity=='l1' or aggClusterer.affinity=='manhattan': spanner = lambda x:np.max(np.sum(np.abs(x[:,None,:]-x[None,:,:]),axis=2)) Elif aggClusterer.affinity=='l2': spanner = lambda x:np.max(np.sqrt(np.sum((x[:,None,:]-x[None,:,:])**2,axis=2))) Elif aggClusterer.affinity=='cosine': spanner = lambda x:np.max(np.sum((x[:,None,:]*x[None,:,:]))/(np.sqrt(np.sum(x[:,None,:]*x[:,None,:],axis=2,keepdims=True))*np.sqrt(np.sum(x[None,:,:]*x[None,:,:],axis=2,keepdims=True)))) else: raise AttributeError('Unknown affinity attribute value {0}.'.format(aggClusterer.affinity)) Elif aggClusterer.linkage=='average': if aggClusterer.affinity=='euclidean': spanner = lambda x:np.mean(np.sum((x[:,None,:]-x[None,:,:])**2,axis=2)) Elif aggClusterer.affinity=='l1' or aggClusterer.affinity=='manhattan': spanner = lambda x:np.mean(np.sum(np.abs(x[:,None,:]-x[None,:,:]),axis=2)) Elif aggClusterer.affinity=='l2': spanner = lambda x:np.mean(np.sqrt(np.sum((x[:,None,:]-x[None,:,:])**2,axis=2))) Elif aggClusterer.affinity=='cosine': spanner = lambda x:np.mean(np.sum((x[:,None,:]*x[None,:,:]))/(np.sqrt(np.sum(x[:,None,:]*x[:,None,:],axis=2,keepdims=True))*np.sqrt(np.sum(x[None,:,:]*x[None,:,:],axis=2,keepdims=True)))) else: raise AttributeError('Unknown affinity attribute value {0}.'.format(aggClusterer.affinity)) else: raise AttributeError('Unknown linkage attribute value {0}.'.format(aggClusterer.linkage)) return spanner clusterer = AgglomerativeClustering(n_clusters=2,compute_full_tree=True) # You can set compute_full_tree to 'auto', but I left it this way to get the entire tree plotted clusterer.fit(X) # X for whatever you want to fit spanner = get_cluster_spanner(clusterer) newick_tree = build_Newick_tree(clusterer.children_,clusterer.n_leaves_,X,leaf_labels,spanner) # leaf_labels is a list of labels for each entry in X tree = ete3.Tree(newick_tree) tree.show()

jagthebeetle · Answer

Pythonから抜け出し、堅牢なD3ライブラリを使用する場合、d3.cluster()（または、d3.tree()）を使用することはそれほど難しくありません）ニースでカスタマイズ可能な結果を達成するためのAPI。

デモについては jsfiddle をご覧ください。

children_配列は幸運にもJS配列として簡単に機能し、唯一の中間ステップはd3.stratify()を使用して階層表現に変換することです。具体的には、各ノードにidとparentIdが必要です。

var N = 272; // Your n_samples/corpus size. var root = d3.stratify() .id((d,i) => i + N) .parentId((d, i) => { var parIndex = data.findIndex(e => e.includes(i + N)); if (parIndex < 0) { return; // The root should have an undefined parentId. } return parIndex + N; })(data); // Your children_

findIndex行のため、ここでは少なくともO（n ^ 2）の動作になりますが、n_samplesが巨大になるまではおそらく重要ではありません。その場合、より効率的なインデックスを事前計算できます。

それを超えて、d3.cluster()をプラグアンドチャグで使用しています。 mbostockの標準ブロックまたはJSFiddleを参照してください。

N.B.私のユースケースでは、葉以外のノードを表示するだけで十分です。サンプル/リーフを視覚化するのは少し複雑です。これらはすべてchildren_配列に明示的に含まれているとは限らないためです。