Before using the , ensure that you can identify:
The Connector is available as a separate download for Greenplum Database 5.x or 6.x from Broadcom Support Portal:
Download the Connector package by navigating to Broadcom Support Portal and locating Greenplum Spark Connector under the desired Greenplum release.
NoteFor more information about download prerequisites, troubleshooting, and instructions, see Download Broadcom products and software.
The format of the Connector download file name is greenplum-connector-apache-spark-scala_<scala-version>-<gsc-version>.tar.gz
. For example:
greenplum-connector-apache-spark-scala_2.12-2.2.0.tar.gz
The versions of Scala and Spark that you are developing for determine the package that you download:
Spark Version | Scala Version | Connector Package File |
---|---|---|
2.3.x , 2.4.x 3.0.x, 3.1.x, 3.2.x |
2.11 2.12 |
greenplum-connector-apache-spark-scala_2.11-2.2.0.tar.gz greenplum-connector-apache-spark-scala_2.12-2.2.0.tar.gz |
Follow the instructions in Verifying the VMware Greenplum Software Download in the Greenplum Database documentation to verify the integrity of the Greenplum Spark Connector software.
The Connector download package includes the Connector JAR file and the product open source license. Extract the download package:
user@spark-node$ tar zxf greenplum-connector-apache-spark-scala_2.12-2.2.0.tar.gz
This command extracts the license text file and the JAR file named greenplum-connector-apache-spark-scala_2.12-2.2.0.jar
into the current working directory.
Make note of the directory to which the Connector JAR file was extracted.
You can run Spark interactively through spark-shell
, a modified version of the Scala shell. Refer to the spark-shell Spark documentation for detailed information on using this command.
To try out the Connector, run the spark-shell
command providing a --jars
option that identifies the file system path to the Connector JAR file. For example:
user@spark-node$ export GSC_JAR=/path/to/greenplum-connector-apache-spark-scala_2.12-2.2.0.jar
user@spark-node$ spark-shell --jars $GSC_JAR
< ... spark-shell startup output messages ... >
scala>
When you run spark-shell
, you enter the scala>
interactive subsystem. A SparkSession
is instantiated for you and accessible via the spark
local variable:
scala> println(spark)
org.apache.spark.sql.SparkSession@4113d9ab
Your SparkSession
provides the entry points methods that you will use to transfer data between Spark and Greenplum Database.
If you are writing a stand-alone Spark application, you will bundle the Connector along with your other application dependencies into an "uber" JAR. The Spark Self-Contained Applications and Bundling Your Application's Dependencies documentation identifies additional considerations for stand-alone Spark application development.
You can use the spark-submit
command to launch a Spark application assembled with the Connector. You can also run the spark-submit
command providing a --jars
option that identifies the file system path to the Connector JAR file. The spark-submit Spark documentation describes using this command.
The Connector uses a JDBC connection to communicate with the Greenplum Database master node. The PostgreSQL JDBC driver version 42.4.3 is bundled with the Connector JAR file, so you do not need to manage this dependency. You may also use a custom JDBC driver with the Connector.
You must provide a JDBC connection string URL when you use the Connector to transfer data between Greenplum Database and Spark. This URL must include the Greenplum Database master hostname and port, as well as the name of the database to which you want to connect.
Parameter Name | Description |
---|---|
<master> | Hostname or IP address of the Greenplum Database master node. |
<port> | The port on which the Greenplum Database server process is listening. Optional, default is 5432. |
<database_name> | The Greenplum database to which you want to connect. |
Note: The Connector requires that other connection options, including user name and password, be provided separately.
The JDBC connection string URL format for the default Connector JDBC driver is:
jdbc:postgresql://<master>[:<port>]/<database_name>
For example:
jdbc:postgresql://gpdb-master:5432/testdb
The syntax and semantics of the default JDBC connection string URL are governed by the PostgreSQL JDBC driver. For additional information about this syntax, refer to Connecting to the Database in the PostgreSQL JDBC documentation.
The Connector also supports using a custom JDBC driver. To use a custom Greenplum Database JDBC driver, you must:
Construct a JDBC connection string URL for your custom driver that includes the Greenplum Database master hostname and port and the name of the database to which you want to connect.
Provide the JAR file for the custom JDBC driver via one of the following options:
--jars <custom-jdbc-driver>.jar
option on your spark-shell
or spark-submit
command line, identifying the full path to the custom JDBC driver JAR file.You must also identify the fully qualified Java class name of the JDBC driver in a Connector option (described in About Connector Options).
Starting with Connector version 2.1.0 (bundled PostgreSQL JDBC driver version 42.2.14), you can set Greenplum Database session parameters in the JDBC connection string URL using the JDBC options
property.
For example, to turn on the optimizer
and optimizer_trace_fallback
configuration parameters for the session, add the following text to the JDBC URL:
?options=-c%20optimizer=on%20-c%20optimizer_trace_fallback=on
Refer to the Connecting to the Database topic in the PostgreSQL JDBC Driver documentation for more information about constructing JDBC connection string URLs.
The Connector uses HikariCP to pool JDBC connections for each Spark application. The Connector creates a new Hikari connection pool for each unique combination of JDBC connection string URL, username, and password.
You can use Connector options to configure connection pool options that bound the number of open JDBC connections between a Spark application and the Greenplum Database server. Setting connection pool options in your Spark application is described in Specifying Connection Pool Options.
Take into consideration both the performance required for your Spark application and the desired resource impact on the Greenplum Database cluster when you change connection pool configuration:
Decreasing the maximum size of the connection pool bounds the number of open connections to the Greenplum Database server. Setting this option too low may decrease the parallelism of your Spark application.
The default minimum number of idle connections in the pool is zero (0). If you increase this value, be aware that the Connector maintains at least that number of open connections, and that some or all of the connections may be idle.
If you decrease the idle timeout for connections in the pool, the Connector closes idle connections sooner. Very short timeout values may defeat the purpose of connection pooling.
The Greenplum Database max_connections
server configuration parameter identifies the maximum number of concurrent open connections that are permitted to the database server. Each running Spark application that uses the Connector owns a number of open connections on the Greenplum Database server. Greenplum Database Connection Errors provides troubleshooting information should you encounter Greenplum Database connection errors in your Spark application.
The Connector pushes evaluation of the following Spark filters down to Greenplum Database:
SQL-like Filter Syntax | Spark Class Name |
---|---|
column_name = 'some value' |
EqualTo , EqualNullSafe |
column_name < 'some_value' |
LessThan |
column_name > 'some_value' |
GreaterThan |
column_name <= 'some_value' |
LessThanOrEqual |
column_name >= 'some_value' |
GreaterThanOrEqual |
column_name IS NULL |
IsNull |
column_name IS NOT NULL |
IsNotNull |
column_name LIKE 'some_value%' |
StringStartsWith |
column_name LIKE %'some_value' |
StringEndsWith |
column_name LIKE %'some_value%' |
StringContains |
column_name IN (...) |
In |