Follow these guidelines to ensure that the works and performs optimally in your environment.

Prerequisites

Before installing and using the Connector, ensure that you meet the following prerequisites:

  • You have administrative access to a running Greenplum Database cluster.

  • You have access to a running Spark cluster.

  • Network connectivity exists between the Greenplum Database master node and the Spark driver and every Spark worker node.

  • Network connectivity exists between every Spark worker node and every Greenplum Database segment host.

    • When Spark is deployed in a Kubernetes cluster, the Ingress load balancer (ServiceType=LoadBalancer) must be routable from every Greenplum Database segment host.

Memory Requirements

Refer to the Hardware Provisioning Memory discussion in the Spark documentation for Spark cluster node memory configuration considerations.

Network Port Requirements

The Greenplum Database master host port number (<port-num>) is configurable. The default master host port is 5432. The Connector utilizes the Greenplum Database master port for Spark driver and worker node communication to the Greenplum Database master. Ensure that TCP port <port-num> on the Greenplum Database master host is open and accessible to the Spark driver and all Spark worker nodes.

The Connector runs an embedded Jetty HTTP server in each Spark executor process that utilizes Greenplum Database external tables and gpfdist parallel file server to transfer data between Greenplum Database segment hosts and Spark worker nodes.

The Connector specifies the location of the Spark executor in the external table LOCATION clause. By default, the Connector starts the gpfdist server process using the IP address of the Spark worker node and defers port number selection to the operating system. This default gpfdist server addressing behavior may not meet your needs if the hosts in your Spark cluster are configured with multiple network interfaces, or when the Spark cluster is deployed in Kubernetes. Configuring the Connector provides detailed coniguration information for these deployment scenarios.

Port requirements:

  • The ephemeral port range on Linux systems is [32768-60999]. When the Connector is configured to select a random gpfdist port (the default), ensure that all ports in this range are accessible from every Greenplum Database segment host.
  • If you directly specify the gpfdist port, ensure that it is accessible from every Greenplum Database segment host.
  • When Spark is deployed in Kubernetes, ensure that port 80 (or 443, if TLS-secured) are accessible from every Greenplum Database segment host.
check-circle-line exclamation-circle-line close-line
Scroll to top icon