The VMware Tanzu Greenplum Connector for Apache Spark provides high speed, parallel data transfer between Greenplum Database and an Apache Spark cluster using Spark's Scala API for programmatic access (including the spark-shell
REPL).
Refer to the VMware Tanzu Greenplum documentation for detailed information about Greenplum Database.
See the Apache Spark documentation for information about Apache Spark version 2.4.
The following table identifies the supported component versions for the VMware Tanzu Greenplum Connector for Apache Spark 2.x:
Connector Version | Greenplum Version | Spark Version | Scala Version | PostgreSQL JDBC Driver Version |
---|---|---|---|---|
2.1.4 | 5.x, 6.x | 2.3.x , 2.4.x 2.4.x, 3.0.x |
2.11 2.12 |
42.4.3 |
2.1.3 | 5.x, 6.x | 2.3.x , 2.4.x 2.4.x, 3.0.x |
2.11 2.12 |
42.4.1 |
2.1.2, 2.1.1 | 5.x, 6.x | 2.3.x , 2.4.x 2.4.x, 3.0.x |
2.11 2.12 |
42.3.3 |
2.1.0, 2.0 | 5.x, 6.x | 2.3.x , 2.4.x 2.4.x, 3.0.x |
2.11 2.12 |
42.2.14 |
The Connector is certified against the Greenplum, Spark, and Scala versions listed above. The Connector is bundled with, and certified against, the listed PostgreSQL JDBC driver version.
Released: December 14, 2022
VMware Tanzu Greenplum Connector for Apache Spark 2.1.4 includes a change and a bug fix.
VMware Tanzu Greenplum Connector for Apache Spark 2.1.4 includes this change:
The following issue was resolved in version 2.1.4:
Bug ID | Summary |
---|---|
CVE‑2022‑41946 | Updates the postgresql JDBC JAR file to version 42.4.3. |
Released: October 18, 2022
VMware Tanzu Greenplum Connector for Apache Spark 2.1.3 includes a change and bug fixes.
VMware Tanzu Greenplum Connector for Apache Spark 2.1.3 includes this change:
The following issues were resolved in version 2.1.3:
Bug ID | Summary |
---|---|
CVE‑2022‑31197 | Updates the postgresql JDBC JAR file to version 42.4.1. |
32449 | Resolves an issue where the Connector, when writing to Greenplum Database, did not close JDBC connections. The connector now closes JDBC connections when it writes to Greenplum Database. |
Released: July 11, 2022
VMware Tanzu Greenplum Connector for Apache Spark 2.1.2 includes changes and bug fixes.
VMware Tanzu Greenplum Connector for Apache Spark 2.1.2 includes these changes:
32232
, the Connector now allows you to configure HikariCP connection pool options as described in Specifying Connection Pool Options.30887
, the Connector now supports numeric-type partition columns when reading from Greenplum Database.The following issues were resolved in version 2.1.2:
Bug Id | Summary |
---|---|
32232 | Resolves an issue where no connections were available due to request time outs. The Connector now exposes the configuration of HikariCP connection pool properties. |
32232 | Resolves an issue where the Connector initiated two different transactions to write to Greenplum, and in some cases these transactions were executed on different connections. The Connector now specifies autocommit=false on the connection, and uses a single transaction and a single connection to write from Spark to Greenplum. |
30887 | Resolves an issue where the Connector did not support a sufficient set of data types for partition columns. The Connector now supports numeric -type partition columns. |
Released: May 4, 2022
VMware Tanzu Greenplum Connector for Apache Spark 2.1.1 includes a change and bug fixes.
VMware Tanzu Greenplum Connector for Apache Spark 2.1.1 includes this change:
The following issues were resolved in version 2.1.1:
Bug Id | Summary |
---|---|
CVE‑2022‑21724 | Updates the postgresql JDBC JAR file to version 42.3.3. |
32201 | Resolves an issue where the Connector, when reading from Greenplum Database, dropped a data row when the first column started with the # character. The Connector now configures the dependent Univosity CSV parser to deactivate comment line processing. |
32186 | Resolves an issue where the Connector returned a NullPointerException when a distinct() operation was applied to a DataFrame before writing from Spark to Greenplum Database. |
Released: November 24, 2020
VMware Tanzu Greenplum Connector for Apache Spark 2.1.0 includes new and changed features and bug fixes.
VMware Tanzu Greenplum Connector for Apache Spark 2.1.0 includes this new and changed feature:
The Connector now uses external temporary tables when it loads data between Greenplum and Spark. Benefits include the following:
CREATE
privileges on the schema in which the accessed Greenplum table resides.The following issues were resolved in VMware Tanzu Greenplum Connector for Apache Spark version 2.1.0:
Bug Id | Summary |
---|---|
31083 | Resolves an issue where the Connector failed to read data from Greenplum Database when the partitionColumn was gp_segment_id and mirroring was enabled in the Greenplum cluster. |
31075 | The developer had no way to specify the schema in which the Connector created its external tables; the Connector always created external tables in the same schema as the Greenplum table. An undesirable side effect of this behaviour was that the Greenplum user reading a table was required to have CREATE privilege on the schema in which the table resided. This issue is resolved; the Connector now uses external temporary tables when it accesses Greenplum tables, and these temporary tables reside in a special, separate Greenplum Database schema. |
Released: September 30, 2020
VMware Tanzu Greenplum Connector for Apache Spark 2.0.0 includes new and changed features and bug fixes.
VMware Tanzu Greenplum Connector for Apache Spark 2.0.0 includes these new and changed features:
The Connector is certified against the Scala, Spark, and JDBC driver versions identified in Supported Platforms above.
The Connector is now bundled with the PostgreSQL JDBC driver version 42.2.14.
The Connector package that you download from Tanzu Network is now a .tar.gz
file that includes the product open source license and the Connector JAR file. The naming format of the file is greenplum-connector-apache-spark-scala_<scala-version>-<gsc-version>.tar.gz
.
For example:
greenplum-connector-apache-spark-scala_2.11-2.0.0.tar.gz
greenplum-connector-apache-spark-scala_2.12-2.0.0.tar.gz
The default gpfdist
server connection activity timeout changes from 30 seconds to 5 minutes.
A new server.timeout
option is provided that a developer can use to specify the gpfdist
server connection activity timeout.
The Connector improves read performance from Greenplum Database by using the internal Greenplum table column named gp_segment_id
as the default partitionColumn
when the developer does not specify this option.
The following issues were resolved in VMware Tanzu Greenplum Connector for Apache Spark version 2.0.0:
Bug Id | Summary |
---|---|
30731 | Resolved an issue where the Connector timed out with a serialization exception when writing aggregated results to Greenplum Database. The Connector now exposes the server.timeout option to specify the gpfdist "no activity" timeout, and sets the default timeout to 5 minutes. |
174495848 | Resolved an issue where predicate pushdown was not working correctly because the Connector did not use parentheses to join the predicates together when it constructed the filter string. |
The VMware Tanzu Greenplum Connector for Apache Spark version 2.x removes:
connector.port
option (deprecated in 1.6).partitionsPerSegment
option (deprecated in 1.5).Known issues and limitations related to the 2.x release of the VMware Tanzu Greenplum Connector for Apache Spark include the following:
gp_segment_id
as the partitionColumn
(the default) when reading data from Greenplum Database and mirroring is enabled in the Greenplum cluster.