Reading from Greenplum Database into Spark

Reading a Greenplum Database table into Spark loads all of the table rows into a Spark DataFrame. You can use the Spark Scala API or the spark-shell interactive shell to read a Greenplum Database table or view that you created with the CREATE TABLE or CREATE VIEW SQL command.

The provides a Spark data source optimized for reading Greenplum Database data into Spark. To read a Greenplum Database table into Spark, you must identify the Connector data source name and provide read options for the import.

Connector Data Source

A Spark data source provides an access point to structured data. Spark provides several pre-defined data sources to support specific file types and databases. You specify a Spark data source using either its fully qualified name or its short name.

The Connector exposes a Spark data source named greenplum to transfer data between Spark and Greenplum Database. The Connector supports specifying the data source only with this short name.

Use the .format(datasource: String) Scala method to identify the data source. You must provide the Connector data source short name greenplum to the .format() method. For example:

spark.read.format("greenplum")

Connector Read Options

The greenplum data source supports the read options identified in the table below. An option is required unless otherwise specified.

Option Key	Value Description
url	The JDBC connection string URL; see Constructing the Greenplum Database JDBC URL.
dbschema	The name of the Greenplum Database schema in which `dbtable` resides. Optional, the default schema is the schema named `public`.
dbtable	The name of the Greenplum Database table. This table must reside in the Greenplum Database schema identified in the `dbschema` option value.
driver	The fully qualified class path of the custom JDBC driver. Optional, specify only when using a custom JDBC driver.
user	The Greenplum Database user/role name.
password	(Optional.) The Greenplum Database password for the user. You can omit the password if Greenplum Database is configured to not require a password for the specified user, or if you use kerberos authentication and provide the required authentication properties in the JDBC connection string URL.
partitionColumn	The name of the Greenplum Database table column to use for Spark partitioning. This column must be one of the Greenplum Database data types identified in the Numeric Types PostgreSQL documentation. Optional, the default partition column is the internal Greenplum Database table column named `gp_segment_id`.
partitions	The number of Spark partitions. Optional, and valid to specify only when the `partitionColumn` is not `gp_segment_id`. The default value is the number of primary segments in the Greenplum Database cluster.
gpdb.matchDistributionPolicy	The external table distribution policy matching mode. Governs how the Connector sets the distribution policy on the external tables that it creates to write to Spark. The default value is `false`, the Connector does not match the external table distribution policy with that of the source Greenplum table. Refer to About the External Table Distribution Policy and Data Motion for more information about this setting and the implications of turning it on.

For more information about Connector options, including specifying the options to a data source, refer to About Connector Options.

Reading Greenplum Data

When you read a Greenplum Database table into Spark, you identify the Connector data source, provide the read options, and invoke the DataFrameReader.load() method. For example:

val gscReadOptionMap = Map(
      "url" -> "jdbc:postgresql://gpdb-master:5432/testdb",
      "user" -> "bill",
      "password" -> "changeme",
      "dbschema" -> "myschema",
      "dbtable" -> "table1",
      "partitionColumn" -> "id"
)

val gpdf = spark.read.format("greenplum")
      .options(gscReadOptionMap)
      .load()

The Connector maps the data type of each Greenplum Database table column to a Spark data type. This mapping is described in Greenplum to Spark Data Type Mapping.

The .load() method returns a DataFrame. A DataFrame is a set of rows, i.e. a DataSet[Row].

Note that the .load() operation does not initiate the movement of data from Greenplum Database to Spark. Spark employs lazy evaluation for transformations; it does not compute the results until the application performs an action on the DataFrame, such as displaying or filtering the data or counting the number of rows.

Actions and transformations that you can perform on the returned DataFrame include:

Viewing the contents of the table with .show()
Counting the number of rows with .count()
Filtering the data using .filter()
Grouping and ordering the data using .groupBy() and orderBy()

Refer to the Spark DataSet API docs for additional information about this class and the other actions and transformations you can perform.

Note
: By default, Spark recomputes a transformed DataFrame each time you run an action on it. If you have a large data set on which you want to perform multiple transformations, you may choose to keep the DataFrame in memory for performance reasons. You can use the DataSet.persist() method for this purpose. Keep in mind that there are memory implications to persisting large data sets.

Using the .greenplum() Shortcut Method

As an alternative to specifying the data source name, the Connector provides a spark.read.greenplum() shortcut method to read data from Greenplum Database into Spark:

spark.read.greenplum(jdbcUrl: String, tableName: String, properties: Properties)

When you use spark.read.greenplum(), you provide the JDBC URL and the Greenplum Database table name as arguments to the method. You specify the remaining options as a set of java.util.Properties. For example:

val url = "jdbc:postgresql://gpmaster.domain:15432/tutorial"
val tblname = "avgdelay"
val jprops = new Properties()
jprops.put("user", "user2")
jprops.put("password", "changeme")
jprops.put("partitionColumn", "airlineid")
val gpdf = spark.read.greenplum(url, tblname, jprops)