Enabling Connector Logging

VMware Greenplum Connector for Apache Spark logging is governed by the logging configuration defined by the Spark application that is running with the Connector JAR file.

Spark uses log4j for logging. The default Spark log file directory is $SPARK_HOME/logs. The default Spark logging configuration file is $SPARK_HOME/conf/log4j.properties. This file specifies the default logging in effect for applications running on that Spark cluster node, including spark-shell. A Spark application may run with its own log4j.properties configuration file. Settings in this logging configuration file may identify an application-specific log file location.

To enable more verbose Connector logging to the console, add the following setting to the log4j.properties file in use by the Spark application:

log4j.logger.io.pivotal.greenplum.spark=DEBUG

You can also configure a Spark application to write Connector log messages to a separate file. For example, to configure the Connector to log to a file named /tmp/log/greenplum-spark.log, add the following text to your Spark application's log4j.properties file:

log4j.logger.io.pivotal.greenplum.spark=DEBUG, gscfile
log4j.appender.gscfile=org.apache.log4j.FileAppender
log4j.appender.gscfile.file=/tmp/log/greenplum-spark.log
log4j.appender.gscfile.append=true
log4j.appender.gscfile.layout=org.apache.log4j.PatternLayout
log4j.appender.gscfile.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

Note: When enabling Connector logging, ensure that you create or update the the log4j.properties file on all Spark driver and executor nodes.

Examining Spark Log Files

Spark generates driver and executor log files. Log files associated with the executor processes of a Spark application using the Connector will include information and errors related to data loading and RDD transformations.

Common Errors

Port Errors

The Connector utilizes TCP ports for Greenplum Database to Spark cluster node communications and data transfer. Port-related errors you may encounter include:

Error Message Discussion
java.lang.RuntimeException:
<port-number> is not a valid port number.
Cause: The most likely cause of this error is that a port number that you specified in server.port is outside the valid range (operating system-specific, but typically [1024-65535]).
Solution: Specify server.port port value(s) that are within the supported range.
java.lang.RuntimeException:
Unable to start GpfdistService on any of ports=<list-of-port-numbers>
Cause: The most likely cause of this error is that the port number(s) that you specified in server.port are already in use. This situation prevents the Spark worker from using the port to receive data from Greenplum Database.
Solution: Try specifying a different set of port numbers in server.port.

Memory Errors

If your Spark application encounters Java memory errors when using the Connector, consider increasing the partitions read option value. Increasing the number of Spark partitions may decrease the memory requirements per partition.

You may choose to specify --driver-memory and --executor-memory options with your spark-shell or spark-submit command to configure specific driver and executor memory allocations. See spark-shell --help or spark-submit --help and Submitting Applications in the Spark documentation.

For additional information regarding memory considerations in your Spark application, refer to Memory Tuning in the Spark documentation and Apache Spark JIRA SPARK-6235.

Greenplum Database Connection Errors

A Spark application may encounter "connection limit exceeded" errors when the number of open connections to the Greenplum Database server approaches its configured maximum limit (max_connections).

The Greenplum Database pg_stat_activity view provides information about current database activity. To help troubleshoot connection-related errors, run the following Greenplum Database commands and queries to determine the number and source of open connections to Greenplum Database.

  • Display the max_connections setting for the Greenplum Database server:

    postgres=# show max_connections;
     max_connections
    -----------------
     250
    (1 row)
    
  • Display the number of open connections to the Greenplum Database server:

    postgres=# SELECT count(*) FROM pg_stat_activity;
    
  • View the number of connections to a specific database or from a specific user:

    postgres=# SELECT count(*) FROM pg_stat_activity WHERE datname='tutorial';
    postgres=# SELECT count(*) FROM pg_stat_activity WHERE usename='user1';
    
  • Display idle and active query counts in the Greenplum Database cluster:

    postgres=# SELECT count(*) FROM pg_stat_activity WHERE current_query='<IDLE>';
    postgres=# SELECT count(*) FROM pg_stat_activity WHERE current_query!='<IDLE>';
    
  • View the database name, user name, client address, client port, and current query for each open connection to the Greenplum Database server:

    postgres=# SELECT datname, usename, client_addr, client_port, current_query FROM pg_stat_activity;
    

If you identify a Spark application using the Connector as the source of too many open connections, adjust the connection pool configuration options appropriately. Refer to JDBC Connection Pooling for additional information about connection pool configuration options.

Greenplum Database Data Length Errors

A Spark application running against Greenplum Database 5.x may encounter a "data line too long" error when the size of a data record that the Greenplum Spark Connector writes to Greenplum exceeds the gp_max_csv_line_length server configuration parameter setting. The default gp_max_csv_line_length is 1 MB. To increase this value for your Greenplum Database cluster, you must log in to the master host and reconfigure this server parameter.

For example, to increase the value to 5 MB, run the following gpconfig command and reload the Greenplum Database configuration:

gpadmin@gpmaster$ gpconfig -c gp_max_csv_line_length -v 5242880
gpadmin@gpmaster$ gpstop -u

Working with Unsupported Data Types

The Connector supports the data types identified in the Greenplum Database ↔ Spark Data Type Mapping topic. Because the Connector does not implicitly cast to type string, when you access a column defined with an unsupported data type, the Connector returns an error.

You have two options to access a Greenplum Database table that includes a data type that is not supported by the Connector:

  • Use a VIEW; this option does not duplicate the table data, but it does require that you specify a partitionColumn.
  • Use an UNLOGGED TABLE; while this option duplicates the table data, you are not required to specify a partitionColumn because the Connector can implicitly use the gp_segment_id.

Reading from a View

One option that you can use to access a Greenplum Database table that includes a data type that is not supported by the Connector is to create and read from a VIEW, where the view definition contains an explicit type cast for each column defined with an unsupported type.

For example, suppose the Greenplum Database table that you want to access is defined as follows:

CREATE TABLE netinfo (
    source      inet,
    source_port integer,
    packets     integer,
    prot        smallint
) DISTRIBUTED BY (source);

The Connector does not support the inet data type. If you read this table with the Connector, it returns:

java.lang.IllegalArgumentException: Unsupported type inet

Now, define a Greenplum Database view that explicitly casts the unsupported inet column to text:

CREATE VIEW netinfo_view AS
  SELECT gp_segment_id, CAST(source AS text), source_port, packets, prot
  FROM netinfo;

Notice that the view definition includes the built-in Greenplum gp_segment_id column of the source table; the Connector uses this column as the partitionColumn when none is specified. If you omit gp_segment_id from the view definition, you must explicitly specify the partitionColumn. To mitigate any performance impacts, be sure to choose a column where the data is evenly distributed amongst the Greenplum Database segments.

When you construct the Connector options, provide the view name in the dbtable property mapping:

.option("dbtable", "netinfo_view")

You can use the Connector to access the view in the same manner as you would access a table.

Reading from an Unlogged Table

The other option that you can use to access a Greenplum Database table that includes a data type that is not supported by the Connector is to create and read from an UNLOGGED TABLE, where the table definition contains an explicit type cast for each column defined with an unsupported type. (Refer to the CREATE TABLE AS reference page in the Greenplum Database documentation for more information about unlogged tables.)

Given the netinfo table definition above, you can define a Greenplum Database unlogged table that explicitly casts the unsupported inet column to text.

CREATE UNLOGGED TABLE netinfo_unloggedtbl AS
  SELECT CAST(source AS text), source_port, packets, prot
  FROM netinfo;

When you construct the Connector options, provide the unlogged table name in the dbtable property mapping:

.option("dbtable", "netinfo_unloggedtbl")

If you do not specify a partitionColumn, the Connector uses the gp_segment_id.

You can use the Connector to access the unlogged table in the same manner as you would access a regular table.

Manually Removing Routes from the Kubernetes Ingress

The routes (paths) that the Connector adds to the Ingress when Spark is deployed in a Kubernetes cluster include the Spark application identifier. If the application does not exit cleanly (for example, due to an uncaught runtime exception), the Connector might not have removed the paths from the Ingress, and you must manually remove them.

You can use the kubectl CLI to identify Ingress rules that are safe to manually remove as follows:

  1. Run the kubectl describe ingress command to view the Ingress configuration. For example, to describe the Ingress named gsc-gpfdist-nginx-ingress:

    $ kubectl describe ingress gsc-gpfdist-nginx-ingress
    
  2. Examine the Rules output. Search for the Paths that match the Spark application identifier. Rules that are associated with removed services include (<none>) in the Backends. Example output:

    Rules:
     Host                   Path Backends
     ----                   ---- --------
     host.domain.io
                            / web:8080 (172.17.0.4:8080)
                            /spark-e3a3780182c14e25ab6f573a62e868b3/exec/1    greenplum-smoke-test-ac3aa37b83ea4144-exec-1-gpfdist-svc:6543 (<none>)
                            /spark-e3a3780182c14e25ab6f573a62e868b3/exec/2    greenplum-smoke-test-ac3aa37b83ea4144-exec-2-gpfdist-svc:6543 (<none>)
    
  3. Open a visual editor to edit the Ingress and delete all of the rules associated with the Spark application identifier. For example:

    $ kubectl edit ingress gsc-gpfdist-nginx-ingress
    
check-circle-line exclamation-circle-line close-line
Scroll to top icon