Pyspark-複数の列の集計

Question

以下のようなデータがあります。ファイル名：babynames.csv。

year name percent sex 1880 John 0.081541 boy 1880 William 0.080511 boy 1880 James 0.050057 boy

年と性別に基づいて入力を並べ替える必要があり、出力を以下のように集約します（この出力は新しいRDDに割り当てられます）。

year sex avg(percentage) count(rows) 1880 boy 0.070703 3

Pysparkでの次のステップの後にどうすればよいかわかりません。これであなたの助けが必要

testrdd = sc.textFile("babynames.csv"); rows = testrdd.map(lambda y:y.split(',')).filter(lambda x:"year" not in x[0]) aggregatedoutput = ????

zero323 · Answer

データを読み込む

df = (sqlContext.read .format("com.databricks.spark.csv") .options(inferSchema="true", delimiter=";", header="true") .load("babynames.csv"))

必要な機能をインポートする

from pyspark.sql.functions import count, avg

グループ化および集計（オプションでColumn.aliasを使用：
```
df.groupBy("year", "sex").agg(avg("percent"), count("*")) 
```

あるいは：