pyspark.pandas.
read_spark_io
Load a DataFrame from a Spark data source.
Path to the data source.
Specifies the output data source format. Some common ones are:
‘delta’
‘parquet’
‘orc’
‘json’
‘csv’
Input schema. If none, Spark tries to infer the schema automatically. The schema can either be a Spark StructType, or a DDL-formatted string like col0 INT, col1 DOUBLE.
Index column of table in Spark.
All other options passed directly into Spark’s data source.
See also
DataFrame.to_spark_io
DataFrame.read_table
DataFrame.read_delta
DataFrame.read_parquet
Examples
>>> ps.range(1).to_spark_io('%s/read_spark_io/data.parquet' % path) >>> ps.read_spark_io( ... '%s/read_spark_io/data.parquet' % path, format='parquet', schema='id long') id 0 0
>>> ps.range(10, 15, num_partitions=1).to_spark_io('%s/read_spark_io/data.json' % path, ... format='json', lineSep='__') >>> ps.read_spark_io( ... '%s/read_spark_io/data.json' % path, format='json', schema='id long', lineSep='__') id 0 10 1 11 2 12 3 13 4 14
You can preserve the index in the roundtrip as below.
>>> ps.range(10, 15, num_partitions=1).to_spark_io('%s/read_spark_io/data.orc' % path, ... format='orc', index_col="index") >>> ps.read_spark_io( ... path=r'%s/read_spark_io/data.orc' % path, format="orc", index_col="index") ... id index 0 10 1 11 2 12 3 13 4 14