Spark DataFrame列でヒストグラムを作成する

Question

次のようなデータフレームの列でヒストグラムを作成しようとしています

DataFrame[C0: int, C1: int, ...]

列C1のヒストグラムを作成する場合、どうすればよいですか？

私が試したいくつかのことは

df.groupBy("C1").count().histogram() df.C1.countByValue()

データ型の不一致のために機能しません。

zero323 · Answer

histogram_numeric Hive UDAFを使用できます。

import random random.seed(323) sqlContext = HiveContext(sc) n = 3 # Number of buckets df = sqlContext.createDataFrame( sc.parallelize(enumerate(random.random() for _ in range(1000))), ["id", "v"] ) hists = df.selectExpr("histogram_numeric({0}, {1})".format("v", n)) hists.show(1, False) ## +------------------------------------------------------------------------------------+ ## |histogram_numeric(v,3) | ## +------------------------------------------------------------------------------------+ ## |[[0.2124888140177466,415.0], [0.5918851340384337,330.0], [0.8890271451209697,255.0]]| ## +------------------------------------------------------------------------------------+

目的の列を抽出して、histogramに対してRDDメソッドを使用することもできます。

df.select("v").rdd.flatMap(lambda x: x).histogram(n) ## ([0.002028109534323752, ## 0.33410233677189705, ## 0.6661765640094703, ## 0.9982507912470436], ## [327, 326, 347])

lanenok · Answer

私のために働いたのは

df.groupBy("C1").count().rdd.values().histogram()

Py [spark.RDDクラスにhistogramメソッドが見つかりましたが、spark.SQLモジュールにはないため、RDDに変換する必要があります

Briford Wylie · Answer

@ -Chris van den Bergが言及した pyspark_dist_explore パッケージは非常に素晴らしいものです。追加の依存関係を追加したくない場合は、このコードを使用して単純なヒストグラムをプロットできます。

import matplotlib.pyplot as plt # Show histogram of the 'C1' column bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20) # This is a bit awkward but I believe this is the correct way to do it plt.hist(bins[:-1], bins=bins, weights=counts)

Assaf Mendelson · Answer

C1の値が1〜1000で、10個のビンのヒストグラムを取得するとします。次のようなことができます：df.withColumn（ "bins"、df.C1/100）.groupBy（ "bins"）。count（）ビニングがより複雑な場合は、そのUDFを作成できます（さらに悪いことに、最初に、たとえば、describeを使用するか、他の方法で列を分析する必要がある場合があります）。

Chris van den Berg · Answer

ヒストグラムをプロットする場合は、 pyspark_dist_explore パッケージを使用できます。

fig, ax = plt.subplots() hist(ax, df.groupBy("C1").count().select("count"))

pandas DataFrameのデータが必要な場合は、次のように使用できます。

pandas_df = pandas_histogram(df.groupBy("C1").count().select("count"))