Writing a Spark DataFrame
into a Greenplum Database table loads each Row
in the DataFrame
into the table. You can use the Spark Scala API or the spark-shell
interactive shell to write Spark data to a Greenplum Database table that you created with the CREATE TABLE
SQL command.
The Connector provides a Spark data source optimized for writing Spark data into Greenplum Database data. To write to a Greenplum Database table, you must identify the Connector data source name and provide write options for the export.
A Spark data source provides an access point to structured data. Spark provides several pre-defined data sources to support specific file types and databases. You specify a Spark data source using either its fully qualified name or its short name.
The Connector exposes a Spark data source named greenplum
to transfer data between Spark and Greenplum Database. The Connector supports specifying the data source only with this short name.
Use the .format(datasource: String)
Scala method to identify the data source. You must provide the Connector data source short name greenplum
to the .format()
method. For example:
dfToWrite.write.format("greenplum")
The greenplum
data source supports the write options identified in the table below. An option is required unless otherwise specified.
Option Key | Value Description | |
---|---|---|
url | The JDBC connection string URL; see Constructing the Greenplum Database JDBC URL. | |
dbschema | The name of the Greenplum Database schema in which dbtable resides. Optional, the default schema is the schema named public . |
|
dbtable | The name of the Greenplum Database table. The Connector creates this table in the schema named dbschema if it does not exist. |
|
driver | The fully qualified class path of the custom JDBC driver. Optional, specify only when using a custom JDBC driver. | |
user | The Greenplum Database user/role name. | |
password | The Greenplum Database password for the user. You can omit the password if Greenplum Database is configured to not require a password for the specified user, or if you use kerberos authentication and provide the required authentication properties in the JDBC connection string URL. Optional. | |
truncate | The table overwrite mode. Governs the table creation actions of the Connector when you specify SaveMode.Overwrite and the target Greenplum Database table exists. The default value is false ; the Connector drops and then re-creates the target table before it writes any data. When true , the Connector truncates the target table before writing data. Optional. |
Write |
distributedBy | The distribution column(s) of the Greenplum table. Governs the table creation action of the Connector when the target Greenplum Database table does not exist, or when you specify SaveMode.Overwrite on a write operation and truncate is false . The Connector (re)creates the table with random distribution by default. When you provide one or more distributedBy columns, the Connector (re)creates the table with a DISTRIBUTED BY clause that specifies these column names. Optional. |
Write |
When you write a Spark DataFrame
to a Greenplum Database table, you identify the Connector data source, provide the write options, and invoke the DataFrameWriter.save()
method on the Spark DataFrame
that you want to write. For example:
val gscWriteOptionMap = Map(
"url" -> "jdbc:postgresql://gpdb-master:5432/testdb",
"user" -> "bill",
"password" -> "changeme",
"dbschema" -> "myschema",
"dbtable" -> "table2",
)
// generate some data to write
val data = Seq(1,2,3)
val dfToWrite=data.toDF()
dfToWrite.write.format("greenplum")
.options(gscWriteOptionMap)
.save()
You can specify the behavior of the Connector when the Greenplum Database table already exists, or when the table already contains data. You provide this information with the DataFrameWriter.mode(SaveMode savemode)
method.
The Connector supports the following Spark SaveMode
settings:
SaveMode | Behavior |
---|---|
ErrorIfExists |
The Connector returns an error if the Greenplum Database table already exists. This is the default value. |
Append |
The Connector appends the Spark data to data that may already exist in the Greenplum Database table. |
Ignore |
If the Greenplum Database table already exists, the Connector ignores the write request; it neither writes data to the table nor does it disturb the existing data. |
Overwrite |
If the Greenplum Database table already exists, the truncate option value governs whether the Connector drops and recreates the target table (default, truncate is false ), or truncates the target table (truncate is true ) before writing the data. Note: The Connector cannot truncate a Greenplum Database table if it includes any foreign key constraints. |
You must enter import org.apache.spark.sql.SaveMode
in your Spark Scala application or the
SaveMode
. For example, to specify the Append
mode on write in a Scala application:
import org.apache.spark.sql.SaveMode
dfToWrite.write.format("greenplum")
.options(gscWriteOptionMap)
.mode(SaveMode.Append)
.save()
To specify the Append
mode on a write in a PySpark application, provide the mode text string:
.mode("Append")
When you call save()
on a Spark DataFrame
to write to a Greenplum Database table, the Connector either saves the entire DataFrame
to the table, or halts the write operation.
Warning: If the Connector encounters an error during .save()
, the target Greenplum table may be left in an inconsistent state.
When you write Spark data to a Greenplum Database table, the Connector creates the table for you if it does not already exist, or when you specify SaveMode.Overwrite
and the truncate
write option is set to false
. Alternatively, you can choose to pre-create the target table.
When the Connector creates the Greenplum Database table for you, the Connector:
Creates the Greenplum Database table with random distribution unless you specify one or more distribution columns with the distributedBy
write option. You can specify a single distribution column name, or a comma-separated list of columns names.
Creates the Greenplum table using the column names defined in the Spark DataFrame
(for example, "flt_month"
). Note that the Spark column name reflects any transformations you may have performed on the Spark data. For example, if you used the avg()
method to average the data in a column named depdelayminutes
, the Spark column name is literally avg(depdelayminutes)
.
You can use the .as()
method to provide a different name for the column. For example: .avg("depdelayminutes").as("depdelaymins")
.
Specifies the column names in double-quotes. This renders the column names case-sensitive to Greenplum Database.
Creates the Geenplum table column with the NOT NULL
clause when the Spark DataFrame
column is not nullable
.
Creates the table columns in the order that they are defined in the Spark DataFrame
.
Maps the data type of each Spark column to the equivalent Greenplum data type.
If you choose to pre-create the target Greenplum Database table, take the following into consideration:
DataFrame
that you write. Keep in mind that the Connector treats column names as case-sensitive.DataFrame
column. Refer to Spark to Greenplum Data Type Mapping for data type mapping information.NOT NULL
clause when you create the associated Greenplum Database table column.DataFrame
, or with a different column ordering. The Greenplum table must not include columns that are not present in the Spark DataFrame
.