Configuring and Managing the VMware Tanzu Greenplum Streaming Server

The Greenplum streaming server (GPSS) manages communication and data transfer between a client (for example, the VMware Tanzu Greenplum Connector for Apache NiFi) and VMware Tanzu Greenplum. You must configure and start a GPSS instance before you use the service to load data into Tanzu Greenplum.

Prerequisites

The Tanzu Greenplum streaming server gpss and gpsscli command line utilities are automatically installed with Tanzu Greenplum version 5.16 and later.

Before you start a GPSS server instance, ensure that you:

Install and start a compatible Tanzu Greenplum version.
Can identify the hostname of your coordinator node.
Can identify the port on which your Tanzu Greenplum coordinator server process is running, if it is not running on the default port (5432).
Select one or more GPSS host machines that have connectivity to:
- The GPSS client host systems.
- The Tanzu Greenplum coordinator and all segment hosts.

If you are using the gpsscli client utility, ensure that you run the command on a host that has connectivity to:

The client data source host systems. For example, for a Kafka data source, you must have connectivity to each broker host in the Kafka cluster.
The Tanzu Greenplum coordinator and all segment hosts.

Registering the GPSS Extension

The Tanzu Greenplum and the Tanzu Greenplum streaming server download packages install the GPSS extension. This extension must be registered in each database in which Greenplum users use GPSS to write data to Greenplum tables.

GPSS automatically registers its extension in a database the first time a Greenplum superuser or the database owner initiates a load job. You must manually register the extension in a database if non-privileged Greenplum users will be the first or only users of GPSS in that database.

Perform the following procedure as a Tanzu Greenplum superuser or the database owner to manually register the GPSS extension:

Open a new terminal window, log in to the Tanzu Greenplum coordinator host as the gpadmin administrative user, and set up the Greenplum environment. For example:
```
$ ssh gpadmin@gpcoord
gpadmin@gpcoord$ . /usr/local/greenplum-db/greenplum_path.sh
```
Start the psql subsystem, connecting to a database in which you want to register the GPSS formatter function. For example:
```
gpcoord$ psql -d testdb
```
Enter the following command to register the extension:
```
testdb=# CREATE EXTENSION gpss;
```
Perform steps 2 and 3 for each database in which the Tanzu Greenplum streaming server will write client data.

Configuring the Tanzu Greenplum Streaming Server

You configure an invocation of the Tanzu Greenplum streaming server via a JSON-formatted configuration file. This configuration file includes properties that identify the listen address of the GPSS service and an optional debug server port number, as well as the gpfdist service host, bind address, and port number. You can specify encryption options in the file, can configure a password shadow encode/decode key, and can also configure whether GPSS reuses external tables.

The contents of a sample GPSS JSON configuration file named gpsscfg1.json follow:

{
    "ListenAddress": {
        "Host": "",
        "Port": 5019,
        "DebugPort": 9998
    },
    "Gpfdist": {
        "Host": "",
        "Port": 8319,
        "ReuseTables": false
    },
    "Shadow": {
        "Key": "a_very_secret_key"
    }
}

Refer to the gpss.json reference page for detailed information about the GPSS configuration file format and the configuration properties that the utility supports.

You may find a quick start guide and a sample GPSS configuration file under the $GPHOME/docs/cli_help/gpss directory.

Note
If your Kafka or Tanzu Greenplum clusters are using Kerberos authentication or SSL encryption, see Configuring the Tanzu Greenplum Streaming Server for Encryption and Authentication.

Refer to Configuring the Tanzu Greenplum Streaming Server for Client-to-Server Authentication for information about configuring client authentication for GPSS.

Running the Tanzu Greenplum Streaming Server

You use the gpss utility to start an instance of the Greenplum streaming server on the local host. When you run the command, you provide the name of the configuration file that defines the properties of the GPSS and gpfdist service instances. You can also specify the name of a directory to which gpss writes server and progress log files. For example, to start a GPSS instance specifying a log directory named gpsslogs relative to the current working directory:

$ gpss --config gpsscfg1.json --log-dir ./gpsslogs

By default, GPSS waits 10 seconds to establish a connection to Tanzu Greenplum. If GPSS does not establish the connection in that time, a time out error is displayed and the gpss command returns. You can change the time out value by setting the GPDB_CONNECT_TIMEOUT environment variable before or when you start GPSS. For example, to set the timeout to 30 seconds, start the GPSS instance as follows:

$ GPDB_CONNECT_TIMEOUT=30 gpss --config gpsscfg1.json --log-dir ./gpsslogs

The default mode of operation for gpss is to wait for, and then consume, job requests and data from a client. When run in this mode, gpss waits indefinitely. You can interrupt and exit the command with Control-c. You may also choose to run gpss in the background (&). In both cases, gpss writes server log and status messages to stdout.

Note
gpss keeps track of the status of each client job in memory. When you stop a GPSS server instance that did not specify a JobStore setting in its server configuration file, you lose all registered jobs. You must re-submit any previously-submitted jobs that you require after you restart the server instance. gpss will resume a job from the last load offset.

Refer to the gpss reference page for additional information about this command.

About GPSS Logging

The gpss, gpsscli, and gpkafka commands each write log messages to stdout (front-end) and to a log file (back-end). These messages provide useful information about GPSS command processing and any errors that it encounters.

GPSS supports the following log levels, listed in order from most to least severe:

Level	Description
fatal	Logs conditions that prevent GPSS from functioning, such as being unable to listen on the configured port.
error	Logs job failure messages.
warn or warning	Logs messages that contain information that requires the user's attention, such as the use of deprecated features or notification of job retry in progress.
info	Logs messages that contain information about GPSS actions, including job status changes and requests between the GPSS client and server.
debug	Logs more detailed and more verbose messages that may aid in troubleshooting.

The default log level for command front-end messages to stdout is info. The default log level for back-end messages that the commands write to the log file is debug.

You can change the front-end or back-end log level by specifying a Logging block in the gpss.json GPSS server configuration file and setting the appropriate property:

"Logging": {
    "Backendlevel": "<level>",
    "Frontendlevel": "<level>"
}

The default format for command front-end messages that GPSS writes to stdout uses spaces between fields. You can provide options to commands to instruct GPSS to write the front-end messages in CSV format, or to use color in the message. The default format for back-end messages that GPSS writes to the log file is CSV format.

Managing GPSS Log Files

If you specify the -l or --log-dir option when you start gpss or run a gpsscli subcommand, GPSS writes log messages to a file in the directory that you specify. If you do not provide this option, GPSS writes log messages to a file in the $HOME/gpAdminLogs directory.

By default, GPSS writes server log messages to a file with the following naming format:

gpss_<timestamp-with-millis>.log

Where <timestamp-with-millis> identifies the date and time (with milliseconds) that the log file was created. This date reflects the day/time that you started the gpss server instance, or the day/time that the log was rotated for that server instance (see Rotating the GPSS Server Log File below):

GPSS writes client log messages to a file with the following naming format, where <date> identifies the date that you ran the command:

gpsscli_<date>.log

GPSS writes progress messages for each Kafka job to a separate file in the server log directory. Progress logs are written to a file with this naming format:

progress_<jobname>_<jobid>_<date>.log

<jobname> and <jobid> (max 8 characters each) identify the name and the identifier of the GPSS job, and <date> identifies the date that you ran the command.

Example GPSS log file names:

gpss_23-04-27_151722.950.log
gpsscli_230427.log
progress_jobk2_d577cf37_20200803.log

Configuring Per-Run Server Log Files

You can set the Logging:SplitByJob property in the gpss.json server configuration file to direct GPSS to generate per-run server log files for each job:

"Logging": {
    "SplitByJob": "<level>"
}

The only valid <level> is StartTime.

When you specify "SplitByJob": "StartTime" in the server configuration file, GPSS creates, for every job, a new log file in the server log directory each time the job is started (gpsscli start) or loaded (gpsscli load). GPSS creates the log file regardless of the success or failure of the start or load job operation.

When you set SplitByJob, the server log file name will also include the job name: gpss_<jobname>_<timestamp-with-millis>.log. For example:

gpss_nightly_load-23-05-11_104800.123.log

When the job is stopped (gpsscli stop), GPSS logs to the most recently created log file for the specified job.

Rotating the GPSS Server Log File

If the log file for a gpss server instance grows too large, you may choose to archive the current log and start fresh with an empty log file.

There are two ways to rotate GPSS server logs:

Configure GPSS to automatically rotate the server logs.
You initiate server log rotation on-demand.

Configuring Automatic Server Log File Rotation

You can set the Logging:Rotate property in the gpss.json server configuration file to direct GPSS to automatically rotate the server log files:

"Logging": {
    "Rotate": "<policy_period>"
}

Valid <policy_period>s are daily and hourly.

When you specify a Logging:Rotate property setting in the server configuration file, GPSS automatically rotates the server log file for you at the end of the policy period (hour, day) that you specify. If you stop the server instance, a new invocation restarts the time period.

Rotating the Server Log File On-Demand

To prompt GPSS to rotate the server log file on-demand, you must:

Rename the existing log file. For example:

gpadmin@gpcoord$ mv <logdir>/gpss_<timestampmillis>.log <logdir>/gpss_<timestampmillis>.log.1

Send the SIGUSR2 signal to the gpss server process. You can obtain the process id of a GPSS instance by running the ps command. For example:
```
gpadmin@gpcoord$ ps -ef | grep gpss
gpadmin@gpcoord$ kill -SIGUSR2 <gpss_pid>
```
Note
There may be more than one gpss server process running on the system. Be sure to send the signal to the desired process.

When it receives the signal, GPSS emits a log message that identifies the time at which it reset the log file. For example:
```
... -[INFO]:- gpss log file rotate at 20230411:20:59:36.093
```

Integrating with logrotate

You can configure and manage GPSS server log file rotation with the Linux logrotate utility.

This sample logrotate configuration rotates and compresses the log file of each gpss server instance running on the system weekly or when the file reaches 10MB in size. It operates on all-in-one server log files that are written to the default location:

/home/gpadmin/gpAdminLogs/gpss_*.log {
    rotate 5
    weekly
    size 10M
    postrotate
        pkill -SIGUSR2 gpss
    endscript
    compress
}

If this configuration is specified in a file named gpss_rotate.conf residing in the current working directory, you integrate with the Linux logrotate system with the following command:

$ logrotate -s status -d gpss_rotate.conf

You may choose to create a cron job to run this command daily.

Monitoring GPSS Service Instances

GPSS provides out-of-the-box integration with the Prometheus open-source system monitoring and alerting toolkit. Refer to Enabling Prometheus Metrics Collection for information about enabling and using this integration.

About GPSS Job Management

When you submit a GPSS job, you provide a name/identifier for the job. If you do not specify a job name, GPSS assigns and returns the base name of the load configuration file as the job name. You use this name to manage the job throughout its lifecycle.

GPSS uses a data source-specific combination of properties specified in a load configuration file to internally identify a job. For example, when it loads from a Kafka data source, GPSS uses the Kafka topic name, and the target Tanzu Greenplum, schema, and table names for internal job identification. GPSS creates internal and external tables for each job that are keyed off of these properties; these tables keep track of the progress of the load operation. GPSS considers any load configuration file submitted with the same value for these job-identifying properties to be the same internal job.

A gpss server instance keeps track of the status of each client job in memory. By default, this information is invocation-specific. When you stop the server instance, you must re-submit any job that you want to run when you next restart the instance.

You can configure the JobStore property block in the gpss.json server configuration file to instruct GPSS to retain job and job status information across invocations.

"JobStore": {
    "File": {
        "Directory": "<jobstore_dir>"
    }
}

When you specify a JobStore:File:Directory property setting in the server configuration file, the GPSS server instance keeps track of, and writes job information to, the directory that you specify. If you stop the server instance, a new invocation will restore the jobs that were in progress when it last exited, loading the jobs in memory and restoring their last known state.

Shadowing the VMware Tanzu Greenplum Password

When you use GPSS to load data into Tanzu Greenplum, you specify the Greenplum user/role password in the PASSWORD: property setting of a YAML-format load configuration file; see gpsscli.yaml.

You specify the Greenplum password in clear text. If your security requirements do not permit this, you can configure GPSS to encode and decode a shadow password string that the GPSS client and server use when communicating the Greenplum password.

Note
GPSS supports shadowing the Greenplum password only on load jobs that you submit and manage with the gpsscli subcommands. GPSS does not support shadowed passwords on load jobs that you submit with gpkafka load.

When you use this GPSS feature:

(Optional) You configure a Shadow:Key in the gpss.json configuration file that you specify when you start the GPSS instance. For example:
```
...
    },
    "Shadow": {
        "Key": "a_very_secret_key"
    }
...
```
You run the gpsscli shadow command on the ETL system to interactively generate the shadowed password. For example:
```
$ gpsscli shadow --config gpss.json
please input your password
changemeCHANGEMEchangeme
"SHADOW:ERTBKXDWLAJHUF5UOGJY34QTXIBNYP4ULTWVHIUZIF4UYFPRIJVA"
```
You can automate this step using a command similar to the following:
```
$ echo changemeCHANGEMEchangeme | gpsscli shadow --config gpss.json | tail -1
"SHADOW:ERTBKXDWLAJHUF5UOGJY34QTXIBNYP4ULTWVHIUZIF4UYFPRIJVA"
```
If you do not specify the --config gpss.json option, or this configuration file does not include a Shadow:Key setting, GPSS uses its default key to generate the shadow password string.
You specify the shadow password string returned by gpsscli shadow in the PASSWORD: property setting of a gpsscli.yaml load configuration file. For example:
```
DATABASE: testdb
USER: testuser
PASSWORD: "SHADOW:ERTBKXDWLAJHUF5UOGJY34QTXIBNYP4ULTWVHIUZIF4UYFPRIJVA"
...
```
Always quote the complete shadow password string.
You provide the load configuration file as an option to gpsscli submit or gpsscli load when you submit the job.
The GPSS instance servicing the job uses its Shadow:Key, or the default key, to decode the shadowed password string specified in PASSWORD:, and connects with VMware Greenplum.

Pulling Information from the Debug Server

When you specify a DebugPort in the gpss.json configuration file, or when you specify the ‑‑debug‑port option to the gpss command, GPSS starts a debug server on the local host. This server makes available additional debug information about the running GPSS instance, including the call stack and performance statistics.

You can use the curl command to view the types of information available (HTML output):

$ curl http://127.0.0.1:9998/debug/pprof/ > debug_info.html

Or, use curl to view specific information:

$ curl http://127.0.0.1:9998/debug/pprof/heap?debug=1 > debug_heap
$ curl http://127.0.0.1:9998/debug/pprof/goroutine?debug=1 > debug_goroutine
$ curl http://127.0.0.1:9998/debug/pprof/block?debug=1 > debug_block

The commands above gather information and write the heap, the call stack, and locking information each to a text file in the current working directory.

The GPSS debug server can also provide CPU profiling data. The following command gathers 10 seconds of CPU profile data and writes it in binary format to the output file named debug_cpu_profile:

$ curl http://127.0.0.1:9999/debug/pprof/profile?seconds=10 > debug_cpu_profile

You may be asked to send the binary output file to support. Alternatively, you can run the Go Profiling Tool on the file to parse and graph the results:

$ go tool pprof debug_cpu_profile
(pprof) web

The web command creates a graph of the profile data in SVG format, and opens the graph in a web browser.