Follow these guidelines to ensure that the VMware Tanzu Greenplum Connector for Apache Spark works and performs optimally in your environment.
Before installing and using the Connector, ensure that you meet the following prerequisites:
Refer to the Hardware Provisioning Memory discussion in the Spark documentation for Spark cluster node memory configuration considerations.
The Greenplum Database master host port number (<port-num>) is configurable. The default master host port is 5432. The Connector utilizes the Greenplum Database master port for Spark driver and worker node communication to the Greenplum Database master. Ensure that TCP port <port-num> on the Greenplum Database master host is open and accessible to the Spark driver and all Spark worker nodes.
The Connector utilizes the Greenplum Database
gpfdist parallel file server to transfer data between Greenplum Database segment hosts and Spark worker nodes. By default, the Connector starts the
gpfdist server process using the IP address of the Spark worker node and defers port number selection to the operating system. You can choose to configure the
gpfdist Spark worker address and port number. Refer to Configuring the Connector Server Address for information about configuring these options.
Ensure that all ports on the Spark worker nodes in the range [1024-65535], or the ports that you configure, are accessible from every Greenplum Database segment host.