2つのデータフレームを結合し、一方からすべての列を選択し、もう一方からいくつかの列を選択します

Question

sparkデータフレームdf1があり、いくつかの列（列 'id'）と2つの列 'id'および 'other'のデータフレームdf2があるとします。

次のコマンドを複製する方法はありますか

sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id")

join（）、select（）などのpyspark関数のみを使用して？

この結合を関数に実装する必要があり、関数パラメーターとしてsqlContextを強制する必要はありません。

ありがとう！

Pablo Estevez · Accepted Answer

最も効率的な方法かどうかはわかりませんが、これは私のために働いた：

from pyspark.sql.functions import col df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])

トリックは次のとおりです。

[col('a.'+xx) for xx in a.columns] : all columns in a [col('b.other1'),col('b.other2')] : some columns of b

maxcnunes · Answer

アスタリスク（*）はエイリアスで機能します。例：

from pyspark.sql.functions import * df1 = df.alias('df1') df2 = df.alias('df2') df1.join(df2, df1.id == df2.id).select('df1.*')

Akhilesh Bhardwaj · Answer

エイリアスを使用しない場合

df1.join(df2, df1.id == df2.id).select(df1["*"],df2["other"])

Katya Handler · Answer

以下は、SQLコンテキストを必要としないが、DataFrameのメタデータを維持するソリューションです。

a = sc.parallelize([['a', 'foo'], ['b', 'hem'], ['c', 'haw']]).toDF(['a_id', 'extra']) b = sc.parallelize([['p1', 'a'], ['p2', 'b'], ['p3', 'c']]).toDF(["other", "b_id"]) c = a.join(b, a.a_id == b.b_id)

次に、c.show()は以下を生成します。

+----+-----+-----+----+ |a_id|extra|other|b_id| +----+-----+-----+----+ | a| foo| p1| a| | b| hem| p2| b| | c| haw| p3| c| +----+-----+-----+----+

Selvaraj S. · Answer

重複するb_idを削除します

c = a.join(b, a.a_id == b.b_id).drop(b.b_id)

Erica · Answer

単に結合を行い、その後、必要な列を選択できます https://spark.Apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe%20join#pyspark.sql .DataFrame.join

filip stepniak · Answer

エラーが発生しました：推奨コードを使用して「見つかりません」：

from pyspark.sql.functions import col df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])

a.columnsをdf1.columnsに変更しましたが、うまくいきました。