spark sqlの列値を変更する方法

Question

SQLでは、UPDATEを使用して列の値を簡単に更新できます。次に例を示します。次のようなテーブル（学生）があります。

student_id, grade, new_student_id 123 B 234 555 A null UPDATE Student SET student_id = new_student_id WHERE new_student_id isNotNull

SparkSql（PySpark）を使用してSparkでそれを行うにはどうすればよいですか？

Alex · Answer

Nullでない場合は、withColumnを使用して既存のnew_student_id列を元のnew_student_id値で上書きできます。そうでない場合は、student_id列の値が使用されます。

from pyspark.sql.functions import col,when #Create sample data students = sc.parallelize([(123,'B',234),(555,'A',None)]).toDF(['student_id','grade','new_student_id']) #Use withColumn to use student_id when new_student_id is not populated cleaned = students.withColumn("new_student_id", when(col("new_student_id").isNull(), col("student_id")). otherwise(col("new_student_id"))) cleaned.show()

サンプルデータを入力として使用する：

+----------+-----+--------------+ |student_id|grade|new_student_id| +----------+-----+--------------+ | 123| B| 234| | 555| A| null| +----------+-----+--------------+

出力データは次のようになります。

+----------+-----+--------------+ |student_id|grade|new_student_id| +----------+-----+--------------+ | 123| B| 234| | 555| A| 555| +----------+-----+--------------+