VMware Greenplum Connector for Apache Spark 2.x Release Notes

The provides high speed, parallel data transfer between Greenplum Database and an Apache Spark cluster using Spark's Scala API for programmatic access (including the spark-shell REPL).

Refer to the VMware Greenplum documentation for detailed information about Greenplum Database.

See the Apache Spark documentation for information about Apache Spark version 3.2.4.

Supported Platforms

The following table identifies the supported component versions for the 2.x:

Connector Version	Greenplum Version	Spark Version	Scala Version	PostgreSQL JDBC Driver Version
2.3.1, 2.3.0	5.x, 6.x, 7.x	2.3.x , 2.4.x 3.0.x, 3.1.x, 3.2.x	2.11 2.12	42.4.3
2.2.0	5.x, 6.x	2.3.x , 2.4.x 3.0.x, 3.1.x, 3.2.x	2.11 2.12	42.4.3
2.1.4	5.x, 6.x	2.3.x , 2.4.x 2.4.x, 3.0.x	2.11 2.12	42.4.3
2.1.3	5.x, 6.x	2.3.x , 2.4.x 2.4.x, 3.0.x	2.11 2.12	42.4.1
2.1.2, 2.1.1	5.x, 6.x	2.3.x , 2.4.x 2.4.x, 3.0.x	2.11 2.12	42.3.3
2.1.0, 2.0	5.x, 6.x	2.3.x , 2.4.x 2.4.x, 3.0.x	2.11 2.12	42.2.14

The Connector is certified against the Greenplum, Spark, and Scala versions listed above. The Connector is bundled with, and certified against, the listed PostgreSQL JDBC driver version.

Release 2.3.1

Release Date: March 1, 2024

2.3.1 includes a change and a bug fix.

Changed Features

2.3.1 includes this change:

The Connector is now bundled with the PostgreSQL JDBC driver version 42.7.2.

Resolved Issues

The following issue was resolved in version 2.3.1:

Bug ID	Summary
CVE‑2024‑1597	Updates the `postgresql` JDBC JAR file to version 42.7.2.

Release 2.3.0

Release Date: January 26, 2024

2.3.0 includes new and changed features and bug fixes.

New and Changed Features

2.3.0 includes these new and changed features:

The Connector now supports VMware Greenplum 7
The Connector now supports setting VMware Greenplum configuration parameters as DataFrame options, by providing an option with the configuration parameter name prefixed with gpdb.guc. and the configuration parameter value. Specifying VMware Greenplum Configuration Parameters provides more information about this feature.

Resolved Issues

The following issues were resolved in version 2.3.0:

Bug Id	Summary
33100	Resolves that an issue occurs when Spark `DataFrame` query uses aggregates or projections that do not include the source table distribution columns with `gpdb.matchDistributionPolicy` is set to `true`

Release 2.2.0

Released: September 18, 2023

2.2.0 includes new and changed features and bug fixes.

New and Changed Features

2.2.0 includes these new and changed features:

The Connector is certified against the Scala, Spark, and JDBC driver versions identified in Supported Platforms above.
A new gpdb.matchDistributionPolicy DataFrame option is provided to direct the Connector to match the distribution policy of the source Greenplum table when it creates external tables to transfer data from Greenplum to Spark. About the External Table Distribution Policy and Data Motion provides more information about this setting and the implications of turning it on.

Caution
Do not set gpdb.matchDistributionPolicy to true if your Spark DataFrame query uses aggregates or projections that do not include the source table distribution columns.
Version 2.2.0 adds Beta support for Spark clusters deployed on Kubernetes. Refer to Configuring the Connector When Spark is Deployed in Kubernetes (Beta) for detailed information about configuring the Connector for Spark k8s deployments. You must use the Scala 12 Connector download when Spark is deployed in Kubernetes.

Resolved Issues

The following issues were resolved in version 2.2.0:

Bug Id Summary

32699 Resolves an issue where spiking CPU conditions were observed during data transfer to Spark due to excessive data motion by Greenplum Database before the transfer. The Connector now exposes a gpdb.matchDistributionPolicy read option that you can use to direct it to match the external table distribution policy with that of the source Greenplum table, which should minimize the pre-transfer data motion.

Bug Id	Summary
32699	Resolves an issue where spiking CPU conditions were observed during data transfer to Spark due to excessive data motion by Greenplum Database before the transfer. The Connector now exposes a `gpdb.matchDistributionPolicy` read option that you can use to direct it to match the external table distribution policy with that of the source Greenplum table, which should minimize the pre-transfer data motion.

Release 2.1.4

Released: December 14, 2022

2.1.4 includes a change and a bug fix.

Changed Features

2.1.4 includes this change:

The Connector is now bundled with the PostgreSQL JDBC driver version 42.4.3.

Resolved Issues

The following issue was resolved in version 2.1.4:

Bug ID	Summary
CVE‑2022‑41946	Updates the `postgresql` JDBC JAR file to version 42.4.3.

Release 2.1.3

Released: October 18, 2022

2.1.3 includes a change and bug fixes.

Changed Features

2.1.3 includes this change:

The Connector is now bundled with the PostgreSQL JDBC driver version 42.4.1.

Resolved Issues

The following issues were resolved in version 2.1.3:

Bug ID	Summary
CVE‑2022‑31197	Updates the `postgresql` JDBC JAR file to version 42.4.1.
32449	Resolves an issue where the Connector, when writing to Greenplum Database, did not close JDBC connections. The connector now closes JDBC connections when it writes to Greenplum Database.

Release 2.1.2

Released: July 11, 2022

2.1.2 includes changes and bug fixes.

Changed Features

2.1.2 includes these changes:

To resolve issue 32232, the Connector now allows you to configure HikariCP connection pool options as described in Specifying Connection Pool Options.
To resolve issue 30887, the Connector now supports numeric-type partition columns when reading from Greenplum Database.

Resolved Issues

The following issues were resolved in version 2.1.2:

Bug Id	Summary
32232	Resolves an issue where no connections were available due to request time outs. The Connector now exposes the configuration of HikariCP connection pool properties.
32232	Resolves an issue where the Connector initiated two different transactions to write to Greenplum, and in some cases these transactions were executed on different connections. The Connector now specifies `autocommit=false` on the connection, and uses a single transaction and a single connection to write from Spark to Greenplum.
30887	Resolves an issue where the Connector did not support a sufficient set of data types for partition columns. The Connector now supports `numeric`-type partition columns.

Release 2.1.1

Released: May 4, 2022

2.1.1 includes a change and bug fixes.

Changed Features

2.1.1 includes this change:

The Connector is now bundled with the PostgreSQL JDBC driver version 42.3.3.

Resolved Issues

The following issues were resolved in version 2.1.1:

Bug Id	Summary
CVE‑2022‑21724	Updates the `postgresql` JDBC JAR file to version 42.3.3.
32201	Resolves an issue where the Connector, when reading from Greenplum Database, dropped a data row when the first column started with the `#` character. The Connector now configures the dependent Univosity CSV parser to deactivate comment line processing.
32186	Resolves an issue where the Connector returned a `NullPointerException` when a `distinct()` operation was applied to a `DataFrame` before writing from Spark to Greenplum Database.

Release 2.1.0

Released: November 24, 2020

2.1.0 includes new and changed features and bug fixes.

New and Changed Features

2.1.0 includes this new and changed feature:

The Connector now uses external temporary tables when it loads data between Greenplum and Spark. Benefits include the following:

Greenplum Database external temporary tables are created and reside in their own schema; the Greenplum user reading the data is no longer required to have CREATE privileges on the schema in which the accessed Greenplum table resides.
Greenplum Database removes external temporary tables when the session is over; manual clean-up of orphaned external tables is no longer required. (Cleaning Up Orphaned Greenplum External Tables in previous versions of the documentation describes this now-unnecessary procedure.)
The Connector reuses external temporary tables; it creates fewer tables and has less of an impact on Greenplum Database catalog bloat.

Resolved Issues

The following issues were resolved in version 2.1.0:

Bug Id Summary

31083 Resolves an issue where the Connector failed to read data from Greenplum Database when the partitionColumn was gp_segment_id and mirroring was enabled in the Greenplum cluster.

31075 The developer had no way to specify the schema in which the Connector created its external tables; the Connector always created external tables in the same schema as the Greenplum table. An undesirable side effect of this behavior was that the Greenplum user reading a table was required to have CREATE privilege on the schema in which the table resided. This issue is resolved; the Connector now uses external temporary tables when it accesses Greenplum tables, and these temporary tables reside in a special, separate Greenplum Database schema.

Bug Id	Summary
31083	Resolves an issue where the Connector failed to read data from Greenplum Database when the `partitionColumn` was `gp_segment_id` and mirroring was enabled in the Greenplum cluster.
31075	The developer had no way to specify the schema in which the Connector created its external tables; the Connector always created external tables in the same schema as the Greenplum table. An undesirable side effect of this behavior was that the Greenplum user reading a table was required to have `CREATE` privilege on the schema in which the table resided. This issue is resolved; the Connector now uses external temporary tables when it accesses Greenplum tables, and these temporary tables reside in a special, separate Greenplum Database schema.

Release 2.0

Released: September 30, 2020

2.0.0 includes new and changed features and bug fixes.

New and Changed Features

2.0.0 includes these new and changed features:

The Connector is certified against the Scala, Spark, and JDBC driver versions identified in Supported Platforms above.
The Connector is now bundled with the PostgreSQL JDBC driver version 42.2.14.
The Connector package that you download from Broadcom Support Portal is now a .tar.gz file that includes the product open source license and the Connector JAR file. The naming format of the file is greenplum-connector-apache-spark-scala_<scala-version>-<gsc-version>.tar.gz.

For example:
- greenplum-connector-apache-spark-scala_2.11-2.0.0.tar.gz
- greenplum-connector-apache-spark-scala_2.12-2.0.0.tar.gz
The default gpfdist server connection activity timeout changes from 30 seconds to 5 minutes.
A new server.timeout option is provided that a developer can use to specify the gpfdist server connection activity timeout.
The Connector improves read performance from Greenplum Database by using the internal Greenplum table column named gp_segment_id as the default partitionColumn when the developer does not specify this option.

Resolved Issues

The following issues were resolved in version 2.0.0:

Bug Id	Summary
30731	Resolved an issue where the Connector timed out with a serialization exception when writing aggregated results to Greenplum Database. The Connector now exposes the `server.timeout` option to specify the `gpfdist` "no activity" timeout, and sets the default timeout to 5 minutes.
174495848	Resolved an issue where predicate pushdown was not working correctly because the Connector did not use parentheses to join the predicates together when it constructed the filter string.

Removed Features

The version 2.x removes:

Support for Greenplum Database 4.x.
The connector.port option (deprecated in 1.6).
The partitionsPerSegment option (deprecated in 1.5).

Beta Features

version 2.x includes this Beta feature:

The Connector adds support for Spark clusters deployed on Kubernetes (introduced in 2.2.0).

Known Issues and Limitations

Known issues and limitations related to the 2.x release of the include the following:

When you set gpdb.matchDistributionPolicy to true and the Spark DataFrame query uses aggregates or projections that do not include the source table distribution columns, the Connector may fail to create a temporary external table and return the error column "<column-name>" named in DISTRIBUTED BY clause does not exist. (Introduced in 2.2.0.)

Workaround: Do not set the gpdb.matchDistributionPolicy option (default value is false) on the DataFrame.
(Resolved in 2.1.0) The Connector cannot use gp_segment_id as the partitionColumn (the default) when reading data from Greenplum Database and mirroring is enabled in the Greenplum cluster.
(Beta Kubernetes support introduced in 2.2.0) The Connector does not support reading from or writing to Greenplum Database when your Spark cluster is deployed on Kubernetes.
The Connector supports basic data types like Float, Integer, String, and Date/Time data types. The Connector does not yet support more complex types. See Greenplum Database ↔ Spark Data Type Mapping for additional information.