Writing from Spark into Greenplum Database

Writing a Spark DataFrame into a Greenplum Database table loads each Row in the DataFrame into the table. You can use the Spark Scala API or the spark-shell interactive shell to write Spark data to a Greenplum Database table that you created with the CREATE TABLE SQL command.

The Connector provides a Spark data source optimized for writing Spark data into Greenplum Database data. To write to a Greenplum Database table, you must identify the Connector data source name and provide write options for the export.

Connector Data Source

A Spark data source provides an access point to structured data. Spark provides several pre-defined data sources to support specific file types and databases. You specify a Spark data source using either its fully qualified name or its short name.

The Connector exposes a Spark data source named greenplum to transfer data between Spark and Greenplum Database. The Connector supports specifying the data source only with this short name.

Use the .format(datasource: String) Scala method to identify the data source. You must provide the Connector data source short name greenplum to the .format() method. For example:

dfToWrite.write.format("greenplum")

Connector Write Options

The greenplum data source supports the write options identified in the table below. An option is required unless otherwise specified.

Option Key	Value Description
url	The JDBC connection string URL; see Constructing the Greenplum Database JDBC URL.
dbschema	The name of the Greenplum Database schema in which `dbtable` resides. Optional, the default schema is the schema named `public`.
dbtable	The name of the Greenplum Database table. The Connector creates this table in the schema named `dbschema` if it does not exist.
driver	The fully qualified class path of the custom JDBC driver. Optional, specify only when using a custom JDBC driver.
user	The Greenplum Database user/role name.
password	The Greenplum Database password for the user. You can omit the password if Greenplum Database is configured to not require a password for the specified user, or if you use kerberos authentication and provide the required authentication properties in the JDBC connection string URL. Optional.
truncate	The table overwrite mode. Governs the table creation actions of the Connector when you specify `SaveMode.Overwrite` and the target Greenplum Database table exists. The default value is `false`; the Connector drops and then re-creates the target table before it writes any data. When `true`, the Connector truncates the target table before writing data. Optional.	Write
distributedBy	The distribution column(s) of the Greenplum table. Governs the table creation action of the Connector when the target Greenplum Database table does not exist, or when you specify `SaveMode.Overwrite` on a write operation and `truncate` is `false`. The Connector (re)creates the table with random distribution by default. When you provide one or more `distributedBy` columns, the Connector (re)creates the table with a `DISTRIBUTED BY` clause that specifies these column names. Optional.	Write

Writing to Greenplum Database

When you write a Spark DataFrame to a Greenplum Database table, you identify the Connector data source, provide the write options, and invoke the DataFrameWriter.save() method on the Spark DataFrame that you want to write. For example:

val gscWriteOptionMap = Map(
      "url" -> "jdbc:postgresql://gpdb-master:5432/testdb",
      "user" -> "bill",
      "password" -> "changeme",
      "dbschema" -> "myschema",
      "dbtable" -> "table2",
)

// generate some data to write
val data = Seq(1,2,3)
val dfToWrite=data.toDF()

dfToWrite.write.format("greenplum")
      .options(gscWriteOptionMap)
      .save()

You can specify the behavior of the Connector when the Greenplum Database table already exists, or when the table already contains data. You provide this information with the DataFrameWriter.mode(SaveMode savemode) method.

The Connector supports the following Spark SaveMode settings:

SaveMode	Behavior
`ErrorIfExists`	The Connector returns an error if the Greenplum Database table already exists. This is the default value.
`Append`	The Connector appends the Spark data to data that may already exist in the Greenplum Database table.
`Ignore`	If the Greenplum Database table already exists, the Connector ignores the write request; it neither writes data to the table nor does it disturb the existing data.
`Overwrite`	If the Greenplum Database table already exists, the `truncate` option value governs whether the Connector drops and recreates the target table (default, `truncate` is `false`), or truncates the target table (`truncate` is `true`) before writing the data. Note: The Connector cannot truncate a Greenplum Database table if it includes any foreign key constraints.

You must enter import org.apache.spark.sql.SaveMode in your Spark Scala application or the spark-shell when you specify a SaveMode. For example, to specify the Append mode on write in a Scala application:

import org.apache.spark.sql.SaveMode

dfToWrite.write.format("greenplum")
      .options(gscWriteOptionMap)
      .mode(SaveMode.Append)
      .save()

To specify the Append mode on a write in a PySpark application, provide the mode text string:

.mode("Append")

When you call save() on a Spark DataFrame to write to a Greenplum Database table, the Connector either saves the entire DataFrame to the table, or halts the write operation.

Warning: If the Connector encounters an error during .save(), the target Greenplum table may be left in an inconsistent state.

Table Creation Modes

When you write Spark data to a Greenplum Database table, the Connector creates the table for you if it does not already exist, or when you specify SaveMode.Overwrite and the truncate write option is set to false. Alternatively, you can choose to pre-create the target table.

When the Connector creates the Greenplum Database table for you, the Connector:

Creates the Greenplum Database table with random distribution unless you specify one or more distribution columns with the distributedBy write option. You can specify a single distribution column name, or a comma-separated list of columns names.
Creates the Greenplum table using the column names defined in the Spark DataFrame (for example, "flt_month"). Note that the Spark column name reflects any transformations you may have performed on the Spark data. For example, if you used the avg() method to average the data in a column named depdelayminutes, the Spark column name is literally avg(depdelayminutes).

You can use the .as() method to provide a different name for the column. For example: .avg("depdelayminutes").as("depdelaymins").
Specifies the column names in double-quotes. This renders the column names case-sensitive to Greenplum Database.
Creates the Geenplum table column with the NOT NULL clause when the Spark DataFrame column is not nullable.
Creates the table columns in the order that they are defined in the Spark DataFrame.
Maps the data type of each Spark column to the equivalent Greenplum data type.

If you choose to pre-create the target Greenplum Database table, take the following into consideration:

Use Greenplum table columns names that match the column names defined in the Spark DataFrame that you write. Keep in mind that the Connector treats column names as case-sensitive.
Select a data type for each Greenplum table column that is equivalent to the data type of the related Spark DataFrame column. Refer to Spark to Greenplum Data Type Mapping for data type mapping information.
If a Spark data column contains NULL values, ensure that you do not specify the NOT NULL clause when you create the associated Greenplum Database table column.
You may create the Greenplum Database table with a subset of the columns in the Spark DataFrame, or with a different column ordering. The Greenplum table must not include columns that are not present in the Spark DataFrame.
Ensure that you assign the Greenplum Database user/role the permissions required to write to the table. The user may also require permission to (re-)create the table. Refer to Role Privileges for information about configuring Greenplum Database privileges.