You can use PXF to access Azure Data Lake, Azure Blob Storage, and Google Cloud Storage object stores. This topic describes how to configure the PXF connectors to these external data sources.
If you do not plan to use these PXF object store connectors, then you do not need to perform this procedure.
To access data in an object store, you must provide a server location and client credentials. When you configure a PXF object store connector, you add at least one named PXF server configuration for the connector as described in Configuring PXF Servers.
PXF provides a template configuration file for each object store connector. These template files are located in the <PXF_INSTALL_DIR>/templates/
directory.
The template configuration file for Azure Blob Storage is <PXF_INSTALL_DIR>/templates/wasbs-site.xml
. When you configure an Azure Blob Storage server, you must provide the following server configuration properties and replace the template value with your account name:
Property | Description | Value |
---|---|---|
fs.adl.oauth2.access.token.provider.type | The token type. | Must specify ClientCredential . |
fs.azure.account.key.<YOUR_AZURE_BLOB_STORAGE_ACCOUNT_NAME>.blob.core.windows.net | The Azure account key. | Replace <YOUR_AZURE_BLOB_STORAGE_ACCOUNT_NAME> with your account key. |
fs.AbstractFileSystem.wasbs.impl | The file system class name. | Must specify org.apache.hadoop.fs.azure.Wasbs . |
The template configuration file for Azure Data Lake is <PXF_INSTALL_DIR>/templates/adl-site.xml
. When you configure an Azure Data Lake server, you must provide the following server configuration properties and replace the template values with your credentials:
Property | Description | Value |
---|---|---|
fs.adl.oauth2.access.token.provider.type | The type of token. | Must specify ClientCredential . |
fs.adl.oauth2.refresh.url | The Azure endpoint to which to connect. | Your refresh URL. |
fs.adl.oauth2.client.id | The Azure account client ID. | Your client ID (UUID). |
fs.adl.oauth2.credential | The password for the Azure account client ID. | Your password. |
The template configuration file for Google Cloud Storage is <PXF_INSTALL_DIR>/templates/gs-site.xml
. When you configure a Google Cloud Storage server, you must provide the following server configuration properties and replace the template values with your credentials:
Property | Description | Value |
---|---|---|
google.cloud.auth.service.account.enable | Enable service account authorization. | Must specify true . |
google.cloud.auth.service.account.json.keyfile | The Google Storage key file. | Path to your key file. |
fs.AbstractFileSystem.gs.impl | The file system class name. | Must specify com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS . |
In this procedure, you name and add a PXF server configuration in the $PXF_BASE/servers
directory on the Greenplum Database master host for the Google Cloud Storate (GCS) connector. You then use the pxf cluster sync
command to sync the server configuration(s) to the Greenplum Database cluster.
Log in to your Greenplum Database master host:
$ ssh gpadmin@<gpmaster>
Choose a name for the server. You will provide the name to end users that need to reference files in the object store.
Create the $PXF_BASE/servers/<server_name>
directory. For example, use the following command to create a server configuration for a Google Cloud Storage server named gs_public
:
gpadmin@gpmaster$ mkdir $PXF_BASE/servers/gs_public
Copy the PXF template file for GCS to the server configuration directory. For example:
gpadmin@gpmaster$ cp <PXF_INSTALL_DIR>/templates/gs-site.xml $PXF_BASE/servers/gs_public/
Open the template server configuration file in the editor of your choice, and provide appropriate property values for your environment. For example, if your Google Cloud Storage key file is located in /home/gpadmin/keys/gcs-account.key.json
:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>google.cloud.auth.service.account.enable</name>
<value>true</value>
</property>
<property>
<name>google.cloud.auth.service.account.json.keyfile</name>
<value>/home/gpadmin/keys/gcs-account.key.json</value>
</property>
<property>
<name>fs.AbstractFileSystem.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
</property>
</configuration>
Save your changes and exit the editor.
Use the pxf cluster sync
command to copy the new server configurations to the Greenplum Database cluster:
gpadmin@gpmaster$ pxf cluster sync