csvファイルをデータフレームとして読み取りながらスキーマを提供します

Question

CSVファイルをデータフレームに読み取ろうとしています。私はcsvファイルを知っているので、データフレームのスキーマがどうあるべきかを知っています。また、spark csvパッケージを使用してファイルを読み取ります。以下のようなスキーマを指定しようとしています。

val pagecount = sqlContext.read.format("csv") .option("delimiter"," ").option("quote","") .option("schema","project: string ,article: string ,requests: integer ,bytes_served: long") .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

しかし、作成したデータフレームのスキーマを確認すると、独自のスキーマを使用しているようです。私は何か間違っていますか？ sparkを作成して、前述のスキーマを取得する方法は？

> pagecount.printSchema root |-- _c0: string (nullable = true) |-- _c1: string (nullable = true) |-- _c2: string (nullable = true) |-- _c3: string (nullable = true)

Arunakiran Nulu · Answer

以下のコードを試してください。スキーマを指定する必要はありません。 inferSchemaをtrueとして指定すると、csvファイルから取得する必要があります。

val pagecount = sqlContext.read.format("csv") .option("delimiter"," ").option("quote","") .option("header", "true") .option("inferSchema", "true") .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

スキーマを手動で指定する場合は、次のように実行できます。

import org.Apache.spark.sql.types._ val customSchema = StructType(Array( StructField("project", StringType, true), StructField("article", StringType, true), StructField("requests", IntegerType, true), StructField("bytes_served", DoubleType, true)) ) val pagecount = sqlContext.read.format("csv") .option("delimiter"," ").option("quote","") .option("header", "true") .schema(customSchema) .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

Alberto Castelo Becerra · Answer

私の分析では、Arunakiran Nuluが提供するソリューションを使用しています（コードを参照）。正しい型を列に割り当てることができますが、返される値はすべてnullです。以前は、オプション.option("inferSchema", "true")を試しましたが、データフレーム内の正しい値を返します（ただし、型は異なります）。

val customSchema = StructType(Array( StructField("numicu", StringType, true), StructField("fecha_solicitud", TimestampType, true), StructField("codtecnica", StringType, true), StructField("tecnica", StringType, true), StructField("finexploracion", TimestampType, true), StructField("ultimavalidacioninforme", TimestampType, true), StructField("validador", StringType, true))) val df_explo = spark.read .format("csv") .option("header", "true") .option("delimiter", "	") .option("timestampFormat", "yyyy/MM/dd HH:mm:ss") .schema(customSchema) .load(filename)

結果

root |-- numicu: string (nullable = true) |-- fecha_solicitud: timestamp (nullable = true) |-- codtecnica: string (nullable = true) |-- tecnica: string (nullable = true) |-- finexploracion: timestamp (nullable = true) |-- ultimavalidacioninforme: timestamp (nullable = true) |-- validador: string (nullable = true)

表は次のとおりです。

|numicu|fecha_solicitud|codtecnica|tecnica|finexploracion|ultimavalidacioninforme|validador| +------+---------------+----------+-------+--------------+-----------------------+---------+ | null| null| null| null| null| null| null| | null| null| null| null| null| null| null| | null| null| null| null| null| null| null| | null| null| null| null| null| null| null|

X.X · Answer

@Nuluによる回答のおかげで、最小限の調整でpysparkに機能します

from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType, IntegerType customSchema = StructType(Array( StructField("project", StringType, true), StructField("article", StringType, true), StructField("requests", IntegerType, true), StructField("bytes_served", DoubleType, true))) pagecount = sc.read.format("com.databricks.spark.csv") .option("delimiter"," ") .option("quote","") .option("header", "false") .schema(customSchema) .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

user3008410 · Answer

Pythonでこれを行うことに興味がある人のために、ここに作業バージョンがあります。

customSchema = StructType([ StructField("IDGC", StringType(), True), StructField("SEARCHNAME", StringType(), True), StructField("PRICE", DoubleType(), True) ]) productDF = spark.read.load('/home/ForTesting/testProduct.csv', format="csv", header="true", sep='|', schema=customSchema) testProduct.csv ID|SEARCHNAME|PRICE 6607|EFKTON75LIN|890.88 6612|EFKTON100HEN|55.66

お役に立てれば。

Charlie 木匠 · Answer

カスタムスキーマを使用した完全なデモの方法を次に示します。

$>シェルコード、

echo " Slingo, iOS Slingo, Android " > game.csv

Scalaコード：

import org.Apache.spark.sql.types._ val customSchema = StructType(Array( StructField("game_id", StringType, true), StructField("os_id", StringType, true) )) val csv_df = spark.read.format("csv").schema(customSchema).load("game.csv") csv_df.show csv_df.orderBy(asc("game_id"), desc("os_id")).show csv_df.createOrReplaceTempView("game_view") val sort_df = sql("select * from game_view order by game_id, os_id desc") sort_df.show

Nilesh Shinde · Answer

これは、CSVの読み込み中に列名をデータフレームに渡すことができるオプションの1つです。

import pandas names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = pandas.read_csv("C:/Users/NS00606317/Downloads/Iris.csv", names=names, header=0) print(dataset.head(10))

出力

 sepal-length sepal-width petal-length petal-width class 1 5.1 3.5 1.4 0.2 Iris-setosa 2 4.9 3.0 1.4 0.2 Iris-setosa 3 4.7 3.2 1.3 0.2 Iris-setosa 4 4.6 3.1 1.5 0.2 Iris-setosa 5 5.0 3.6 1.4 0.2 Iris-setosa 6 5.4 3.9 1.7 0.4 Iris-setosa 7 4.6 3.4 1.4 0.3 Iris-setosa 8 5.0 3.4 1.5 0.2 Iris-setosa 9 4.4 2.9 1.4 0.2 Iris-setosa 10 4.9 3.1 1.5 0.1 Iris-setosa

dalwinder singh · Answer

// import Library import Java.io.StringReader ; import au.com.bytecode.opencsv.CSVReader //filename var train_csv = "/Path/train.csv"; //read as text file val train_rdd = sc.textFile(train_csv) //use string reader to convert in proper format var full_train_data = train_rdd.map{line => var csvReader = new CSVReader(new StringReader(line)) ; csvReader.readNext(); } //declares types type s = String // declare case class for schema case class trainSchema (Loan_ID :s ,Gender :s, Married :s, Dependents :s,Education :s,Self_Employed :s,ApplicantIncome :s,CoapplicantIncome :s, LoanAmount :s,Loan_Amount_Term :s, Credit_History :s, Property_Area :s,Loan_Status :s) //create DF RDD with custom schema var full_train_data_with_schema = full_train_data.mapPartitionsWithIndex{(idx,itr)=> if (idx==0) itr.drop(1); itr.toList.map(x=> trainSchema(x(0),x(1),x(2),x(3),x(4),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12))).iterator }.toDF

YOLO · Answer

Pyspark 2.4以降では、単にheaderパラメーターを使用して正しいヘッダーを設定できます。

data = spark.read.csv('data.csv', header=True)

同様に、scalaを使用する場合は、headerパラメーターも使用できます。

Suresh Chaganti · Answer

単純な文字列としてのスキーマ定義

dateおよびタイムスタンプ

ターミナルまたはシェルからのデータファイルの作成

echo " 2019-07-02 22:11:11.000999, 01/01/2019, Suresh, abc 2019-01-02 22:11:11.000001, 01/01/2020, Aadi, xyz " > data.csv

スキーマを文字列として定義

 user_schema = 'timesta TIMESTAMP,date DATE,first_name STRING , last_name STRING'

データの読み取り

 df = spark.read.csv(path='data.csv', schema = user_schema, sep=',', dateFormat='MM/dd/yyyy',timestampFormat='yyyy-MM-dd HH:mm:ss.SSSSSS') df.show(10, False) +-----------------------+----------+----------+---------+ |timesta |date |first_name|last_name| +-----------------------+----------+----------+---------+ |2019-07-02 22:11:11.999|2019-01-01| Suresh | abc | |2019-01-02 22:11:11.001|2020-01-01| Aadi | xyz | +-----------------------+----------+----------+---------+

sparkにスキーマを推論させる代わりに、スキーマを明示的に定義すると、sparkの読み取りパフォーマンスも向上することに注意してください。