You use the Apache NiFi user interface to configure a dataflow that uses the VMware Greenplum Connector for Apache NiFi PutGreenplumRecord
processor to load record-oriented data from any source into Greenplum Database. You will perform the following tasks when you use the Connector to load data into a Greenplum table:
PutGreenplumRecord
processor.Before you set up a dataflow using the Connector, ensure that:
You can identify the host name or IP address and port number of a running Greenplum Streaming Server instance to which the Connector will direct load requests. Or, you configure and start a new streaming server instance as described in Configuring and Managing the Streaming Server in the Greenplum Streaming Server documentation.
“ReuseTables”: false
in the
gpss.json
server configuration file that you use to start the Greenplum Streaming Server.
Be sure to note the host and port number of the Greenplum Streaming Server instance.
You can identify:
The PutGreenplumRecord
processor uses a controller service to manage the connection to Greenplum Database.
In the Prerequisites section above, you identified the Greenplum Database master host and port number, Greenplum Streaming Server host and port number, the Greenplum database, and the Greenplum user/role name and password. You create and configure an instance of the GreenplumGPSSAdapter
controller service for each unique combination of these property settings as described in About the Greenplum Adapter.
The PutGreenplumRecord
processor can accept record-oriented data in the Avro, CSV, Json, Parquet, and XML formats, among others. The processor uses a Record Reader
controller service to parse and deserialize data in the incoming FlowFiles. The format of the data inside the FlowFile informs both your choice of Record Reader
, and the definition of the Greenplum Database table that the Connector loads data into.
The Record Reader
requires the schema of the input data in order to parse and deserialize it.
NiFi data records are described by a schema. The schema defines the names and types of the fields in the input data records.
A NiFi schema definition is specified in Avro format. The example Avro schema below specifies a record with three fields, and identifies the names and data types of the fields:
{
"name": "datatypes_record",
"type": "record",
"fields": [
{ "name": "lastname", "type": ["string", "null"] },
{ "name": "age", "type": ["int", "null"] },
{ "name": "birthdate", "type": {"type":"int", "logicalType":"date"} }
]
}
When you configure a Record Reader
, you must specify the origin of the schema. A schema may be:
Once you identify and specify the schema of the incoming FlowFiles, you have the information that you need to select and configure the Record Reader
and create the Greenplum Database table.
The PutGreenplumRecord
processor uses a controller service to parse and deserialize incoming data in FlowFiles.
You choose and configure a Record Reader
type and instance that corresponds to the format of the data in the incoming FlowFiles as described in About the Record Reader. You must specify the origin of the schema when you configure the reader controller service.
You specify the name of the target Greenplum table when you configure the PutGreenplumRecord
processor. You must create this table before you initiate a NiFi dataflow; the Connector does not create the table for you.
The data types that you specify for the target Greenplum Database table columns must match the data types of the input FlowFile record fields. You can reference the schema of the input records, or the data itself, to identify its type and definition.
The column names that you specify when you create the Greenplum table must match the input FlowFile field names with a few caveats. When you configure the PutGreenplumRecord
processor, you specify if and how you want the Connector to translate the field names to Greenplum table column names. You can:
Specifying Field and Column Name Mappings describes the available field-to-column translation configuration options for the Connector.
The Greenplum table may be the target of INSERT, UPDATE, or MERGE operations. To update or merge data in a Greenplum table, you must be able to identify a set of table columns that uniquely identifies a row in the table. The Connector uses these Match Columns
to locate existing table rows. About the Insert, Merge, and Update Properties further describes these operations and configuration properties.
When you build a dataflow, you drag components from the NiFi toolbar to the canvas, configure the components, and then connect them together.
When you build a dataflow using the PutGreenplumRecord
processor, you configure it as described above, and in Configuring the Connector.
Remember to define the success, failure, and retry relationships for the FlowFiles processed by the PutGreenplumRecord
, or auto-terminate them.
A component that you add to a dataflow is in the stopped state. You must start all linked components to initiate the flow.
Starting a component may also trigger a dataflow. In other cases, some external event triggers a flow, such as a new file added to a directory, or a new message emitted by some external data source.
When a load operation succeeds, the PutGreenplumRecord
processor emits a SEND
provenance event. These events are displayed in the global or processor-specific Data Provenance
dialog.
You can detect load operation errors in one or more of the following ways:
$NIFI_HOME/logs/nifi-app.log
file.You can also query the target Greenplum table to verify the load.