Configuring the Connector

You use the Apache NiFi user interface to configure a dataflow that uses the VMware Greenplum Connector for Apache NiFi PutGreenplumRecord processor to load record-oriented data from any source into Greenplum Database.

The PutGreenplumRecord processor accepts record-based FlowFiles, sending the data to the Greenplum Streaming Server to write to Greenplum Database. When you configure the processor, you must identify the type and instance of the RecordReader that corresponds to the format of the data contained in incoming FlowFiles, the Greenplum connection specifics, and the Greenplum schema and table.

The default load mode for the Connector is to insert data into Greenplum. You can configure the processor to merge or update data instead, and configuration properties for field to column translation and mappings allow you further specify these operations.

You configure the PutGreenplumRecord processor via the Configure Processor dialog. This dialog includes SETTINGS, SCHEDULING, PROPERTIES, and COMMENTS tabs.

PutGreenplumRecord Settings

The SETTINGS tab specifies FlowFile routing and timeouts for the processor. You can also use this tab to change the name of the processor and activate/deactivate the processor.

The Settings Tab documentation in the Apache NiFi User Guide describes the configuration options on this tab.

PutGreenplumRecord Schedule

The SCHEDULING tab specifies the scheduling strategy, run schedule, and concurrency options for the processor.

When you set Concurrent Tasks to a value greater than one, the processor runs with the specified number of threads. The single PutGreenplumRecord processor instance will process multiple flow files concurrently, each managed by its own session.

The Scheduling Tab documentation in the Apache NiFi User Guide describes the configuration properties on this tab.

PutGreenplumRecord Properties

The PROPERTIES tab of the Configure Processor dialog identifies the PutGreenplumRecord processor configuration properties.

PutGreenplumRecord Properties

The Connector utilizes default values for many of the PutGreenplumRecord properties. You are required to set the Record Reader, Greenplum Adapter, and Greenplum Table Name property values.

The PutGreenplumRecord processor configuration properties are listed and further described in the table and topics below:

Property Name	Description	Default Value
Record Reader	The controller service that deserializes the input FlowFile. Required.
Greenplum Adapter	The controller service that identifies and manages the Greenplum Database and Greenplum Streaming Server connection parameters. Required.
Schema Name	The name of the Greenplum Database schema in which the target table resides. Required.	public
Table Name	The name of the target Greenplum table in which to load the data. Required.
Operation Type	The type of load operation: INSERT, UPDATE, or MERGE. Required.	INSERT
Match Columns	The Greenplum table columns to match with the FlowFile record data. Required for the UPDATE and MERGE operation types.
Translate Field Names	Boolean value that specifies if the Connector translates input FlowFile field names to Greenplum table column names. When `true`, the Connector uses case insensitive matching and ignores underscores. When `false`, the Connector does not translate, and field and column names must match exactly.	true
Unmatched Field Behavior	Specifies the Connector’s behavior when an incoming FlowFile record has a field that does not map to a column in the Greenplum table.	Ignore Unmatched Fields
Unmatched Column Behavior	Specifies the Connector’s behavior when an incoming FlowFile record does not have a field mapping for every one of the Greenplum table columns.	Fail on Unmatched Columns
Rollback On Failure	Boolean value that specifies whether or not the Connector should roll back when it encounters an error processing a FlowFile.	false
Maximum Record Batch Size	Specifies the maximum number of records in each batch of data that the Connector will write to Greenplum. The Connector stores the batch in memory until it reaches this size.	0 (write all records in a single transaction)

About the Insert, Merge, and Update Properties

The Connector supports inserting, merging, and updating records from a FlowFile into a Greenplum Database table. You use the Operation Type property to specify the load mode:

Mode	Description
INSERT	Insert records as new rows into the Greenplum table (the default mode).
MERGE	Use `Match Columns` to match records to existing table rows, and update these rows with the data from the records. A record with no matching database row is inserted into the Greenplum table as new row.
UPDATE	Use `Match Columns` to match records to existing table rows, and update these Greenplum Table rows with data from the records.
Use operation.type Attribute	Obtain the load mode from an `operation.type` attribute in the FlowFile.

When Operation Type is UPDATE or MERGE, you must specify one or more Match Columns, a comma-separated list of column names that uniquely identifies a row in the Greenplum table. The Connector ignores the Match Columns property when the Operation Type is INSERT.

Specifying Field and Column Mapping Properties

The Connector exposes properties that allow you to choose how you want the Connector to map FlowFile record fields to Greenplum Database table columns.

The Translate Field Name property is a boolean value that specifies if the Connector translates field names in the FlowFile record into column names in the Greenplum table. The default value is true; the processor uses case-insensitive matching and ignores underscores when it translates field names into column names. When the value is false, the FlowFile field names must match the Greenplum table column names exactly, or the column value will not be updated.

When an incoming FlowFile record has a field that does not map to any of the columns in the Greenplum table, set the Unmatched Field Behaviour property to specify how the Connector should handle the situation:

Ignore Unmatched Fields - (the default) The Connector ignores any field in the FlowFile record that cannot be mapped to a column in the Greenplum table.
Fail on Unmatched Fields - The Connector routes the FlowFile to the failure relationship when the record has any field that cannot be mapped to a column in the table.
Reference Parameter

If an incoming FlowFile record does not have a field mapping for every one of the columns in the Greenplum table, set the Unmatched Column Behavior property to specify how the Connector should handle the situation:

Ignore Unmatched Columns - The Connector assumes that a column in the table that does not have a matching field in the record is not required.
Warn on Unmatched Columns - The Connector assumes that a column in the table that does not have a matching field in the record is not required, and the Connector logs a warning.
Fail on Unmatched Columns - (the default) A flow fails when a column exists in the table and there is no matching field in the record. The Connector also logs an error.
Reference Parameter

Specifying Failure Rollback Behavior

The Connector distinguishes between the transient and the non-recoverable errors that it encounters. Transient errors are those that may succeed on a later retry, such as a connection attempt to Greenplum Database. Conversely, a FlowFile that contained bad input data would continue to fail when retried.

The Connector applies success or failure at the FlowFile level. That is, the Connector considers a write operation successful if all records in a single FlowFile are written to the Greenplum Database table with no errors. If a single record in the FlowFile fails to write for some reason (say the data is malformed), none of the records in the FlowFile are written to Greenplum, and the Connector considers the operation failed.

Rollback On Failure is a boolean property that specifies whether or not the Connector rolls back the NiFi session when it encounters a failure processing a FlowFile.

The default Rollback On Failure setting is false. When the Connector encounters an error while processing a FlowFile, the FlowFile is routed to the failure or retry relationship based on the error type, and the processor continues processing the next FlowFile.

When Rollback On Failure is true, the Connector:

Stops further processing a FlowFile when it encounters an error,
Rolls back the NiFi session; this penalizes the FlowFile and returns it to the incoming queue, and
Continues processing the next FlowFile.

The rolled back FlowFile may be processed repeatedly by the Connector until it is processed successfully or removed by other means.

Be sure to set an adequate SETTINGS Yield Duration for the processor to avoid retrying too frequently.

Choosing a Maximum Record Batch Size

For each FlowFile it receives, the Connector:

Opens and prepares the table for writing,
Performs one or more writes, and
Closes/commits the write.

The maximum number of records in a write call that the Connector makes to the Greenplum Streaming Server is determined by the Maximum Record Batch Size that you specify for the processor.

The default value is zero (0); there is no limit on the batch size, and the Connector accumulates all FlowFile content in memory before it writes to Greenplum in a single transaction.

Note: This default behavior may be inefficient for FlowFiles that consist of large and/or many records.