Use the PXF HDFS Connector to read JSON-format data. This section describes how to use PXF to access JSON data in HDFS, including how to create and query an external table that references a JSON file in the HDFS data store.
Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read data from HDFS.
JSON is a text-based data-interchange format. JSON data is typically stored in a file with a .json
suffix.
A .json
file will contain a collection of objects. A JSON object is a collection of unordered name/value pairs. A value can be a string, a number, true, false, null, or an object or an array. You can define nested JSON objects and arrays.
Sample JSON data file content:
{
"created_at":"MonSep3004:04:53+00002013",
"id_str":"384529256681725952",
"user": {
"id":31424214,
"location":"COLUMBUS"
},
"coordinates":{
"type":"Point",
"values":[
13,
99
]
}
}
In the sample above, user
is an object composed of fields named id
and location
. To specify the nested fields in the user
object as Greenplum Database external table columns, use .
projection:
user.id
user.location
coordinates
is an object composed of a text field named type
and an array of integers named values
.
To retrieve all of the elements of the values
array in a single column, define the corresponding Greenplum Database external table column as type TEXT[]
.
"coordinates.values" TEXT[]
Refer to Introducing JSON for detailed information on JSON syntax.
To represent JSON data in Greenplum Database, map data values that use a primitive data type to Greenplum Database columns of the same type. JSON supports complex data types including projections and arrays. Use N-level projection to map members of nested objects and arrays to primitive data types.
The following table summarizes external mapping rules for JSON data.
Table 1. JSON MappingJSON Data Type | PXF/Greenplum Data Type |
---|---|
Primitive type (integer, float, string, boolean, null) | Use the corresponding Greenplum Database built-in data type; see Greenplum Database Data Types. |
Array | Use TEXT[] to retrieve the JSON array as a Greenplum text array. |
Object | Use dot . notation to specify each level of projection (nesting) to a member of a primitive or Array type. |
PXF supports two data read modes. The default mode expects one full JSON record per line. PXF also supports a read mode operating on JSON records that span multiple lines.
In upcoming examples, you will use both read modes to operate on a sample data set. The schema of the sample data set defines objects with the following member names and value data types:
The single-JSON-record-per-line data set follows:
{"created_at":"FriJun0722:45:03+00002013","id_str":"343136551322136576","user":{"id":395504494,"location":"NearCornwall"},"coordinates":{"type":"Point","values": [ 6, 50 ]}},
{"created_at":"FriJun0722:45:02+00002013","id_str":"343136547115253761","user":{"id":26643566,"location":"Austin,Texas"}, "coordinates": null},
{"created_at":"FriJun0722:45:02+00002013","id_str":"343136547136233472","user":{"id":287819058,"location":""}, "coordinates": null}
This is the data set for the multi-line JSON record data set:
{
"root":[
{
"record_obj":{
"created_at":"MonSep3004:04:53+00002013",
"id_str":"384529256681725952",
"user":{
"id":31424214,
"location":"COLUMBUS"
},
"coordinates":null
},
"record_obj":{
"created_at":"MonSep3004:04:54+00002013",
"id_str":"384529260872228864",
"user":{
"id":67600981,
"location":"KryberWorld"
},
"coordinates":{
"type":"Point",
"values":[
8,
52
]
}
}
}
]
}
You will create JSON files for the sample data sets and add them to HDFS in the next section.
The PXF HDFS connector reads native JSON stored in HDFS. Before you can use Greenplum Database to query JSON format data, the data must reside in your HDFS data store.
Copy and paste the single line JSON record sample data set above to a file named singleline.json
. Similarly, copy and paste the multi-line JSON record data set to a file named multiline.json
.
Note: Ensure that there are no blank lines in your JSON files.
Copy the JSON data files that you just created to your HDFS data store. Create the /data/pxf_examples
directory if you did not do so in a previous exercise. For example:
$ hdfs dfs -mkdir /data/pxf_examples
$ hdfs dfs -put singleline.json /data/pxf_examples
$ hdfs dfs -put multiline.json /data/pxf_examples
Once the data is loaded to HDFS, you can use Greenplum Database and PXF to query and analyze the JSON data.
Use the hdfs:json
profile to read JSON-format files from HDFS. The following syntax creates a Greenplum Database readable external table that references such a file:
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-file>?PROFILE=hdfs:json[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
The specific keywords and values used in the Greenplum Database CREATE EXTERNAL TABLE command are described in the table below.
Keyword | Value |
---|---|
<path‑to‑hdfs‑file> | The path to the directory or file in the HDFS data store. When the <server_name> configuration includes a pxf.fs.basePath property setting, PXF considers <path‑to‑hdfs‑file> to be relative to the base path specified. Otherwise, PXF considers it to be an absolute path. <path‑to‑hdfs‑file> must not specify a relative path nor include the dollar sign ($ ) character. |
PROFILE | The PROFILE keyword must specify hdfs:json . |
SERVER=<server_name> | The named server configuration that PXF uses to access the data. PXF uses the default server if not specified. |
<custom‑option> | <custom-option>s are discussed below. |
FORMAT ‘CUSTOM’ | Use FORMAT 'CUSTOM' with the hdfs:json profile. The CUSTOM FORMAT requires that you specify (FORMATTER='pxfwritable_import') . |
PXF supports single- and multi- line JSON records. When you want to read multi-line JSON records, you must provide an IDENTIFIER
<custom-option> and value. Use this <custom-option> to identify the name of a field whose parent JSON object you want to be returned as individual tuples.
The hdfs:json
profile supports the following <custom-option>s:
Option Keyword | Syntax, Example(s) | Description |
---|---|---|
IDENTIFIER | &IDENTIFIER=<value> &IDENTIFIER=created_at |
You must include the IDENTIFIER keyword and <value> in the LOCATION string only when you are accessing JSON data comprised of multi-line records. Use the <value> to identify the name of the field whose parent JSON object you want to be returned as individual tuples. |
SPLIT_BY_FILE | &SPLIT_BY_FILE=<boolean> |
Specify how PXF splits the data in <path-to-hdfs-file>. The default value is false , PXF creates multiple splits for each file that it will process in parallel. When set to true , PXF creates and processes a single split per file. |
IGNORE_MISSING_PATH | &IGNORE_MISSING_PATH=<boolean> |
Specify the action to take when <path-to-hdfs-file> is missing or invalid. The default value is false , PXF returns an error in this situation. When the value is true , PXF ignores missing path errors and returns an empty fragment. |
IDENTIFIER
, there is a possibility that PXF could return incorrect results. Should you need to, you can work around this edge case by compressing the JSON file, and having PXF read the compressed file.
Use the following CREATE EXTERNAL TABLE SQL command to create a readable external table that references the single-line-per-record JSON data file and uses the PXF default server.
CREATE EXTERNAL TABLE singleline_json_tbl(
created_at TEXT,
id_str TEXT,
"user.id" INTEGER,
"user.location" TEXT,
"coordinates.values" TEXT[]
)
LOCATION('pxf://data/pxf_examples/singleline.json?PROFILE=hdfs:json')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Notice the use of .
projection to access the nested fields in the user
and coordinates
objects.
To query the JSON data in the external table:
SELECT * FROM singleline_json_tbl;
To access specific elements of the coordinates.values
array, you can specify the array subscript number in square brackets:
SELECT "coordinates.values"[1], "coordinates.values"[2] FROM singleline_json_tbl;
To access the array elements as some type other than TEXT
, you can either cast the whole column:
SELECT "coordinates.values"::int[] FROM singleline_json_tbl;
or cast specific elements:
SELECT "coordinates.values"[1]::int, "coordinates.values"[2]::float FROM singleline_json_tbl;
The SQL command to create a readable external table from the multi-line-per-record JSON file is very similar to that of the single line data set above. You must additionally specify the LOCATION
clause IDENTIFIER
keyword and an associated value when you want to read multi-line JSON records. For example:
CREATE EXTERNAL TABLE multiline_json_tbl(
created_at TEXT,
id_str TEXT,
"user.id" INTEGER,
"user.location" TEXT,
"coordinates.values" TEXT[]
)
LOCATION('pxf://data/pxf_examples/multiline.json?PROFILE=hdfs:json&IDENTIFIER=created_at')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
created_at
identifies the member name of the first field in the JSON record record_obj
in the sample data schema.
To query the JSON data in this external table:
SELECT * FROM multiline_json_tbl;
Starting in version 6.2.0, PXF supports reading a JSON array into a TEXT[]
column. PXF still supports the old methods of using array element projection or a single text-type column to read a JSON array. These access methods are described here.
PXF supports accessing specific elements of a JSON array using the syntax [n]
in the table definition to identify the specific element.
CREATE EXTERNAL TABLE singleline_json_tbl_aep(
created_at TEXT,
id_str TEXT,
"user.id" INTEGER,
"user.location" TEXT,
"coordinates.values[0]" INTEGER,
"coordinates.values[1]" INTEGER
)
LOCATION('pxf://data/pxf_examples/singleline.json?PROFILE=hdfs:json')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Note: When you use this method to identify specific array elements, PXF provides only those values to Greenplum Database, not the whole JSON array.
If your existing external table definition uses array element projection and you want to read the array into a TEXT[]
column, you can use the ALTER EXTERNAL TABLE
command to update the table definition. For example:
ALTER EXTERNAL TABLE singleline_json_tbl_aep DROP COLUMN "coordinates.values[0]", DROP COLUMN "coordinates.values[1]", ADD COLUMN "coordinates.values" TEXT[];
If you choose to alter the external table definition in this manner, be sure to update any existing queries on the external table to account for the changes to column name and type.
PXF supports accessing all of the elements within an array as a single string containing the serialized JSON array by defining the corresponding Greenplum table column with one of the following data types: TEXT
, VARCHAR
, or BPCHAR
.
CREATE EXTERNAL TABLE singleline_json_tbl_stc(
created_at TEXT,
id_str TEXT,
"user.id" INTEGER,
"user.location" TEXT,
"coordinates.values" TEXT
)
LOCATION('pxf://data/pxf_examples/singleline.json?PROFILE=hdfs:json')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
If you retrieve the JSON array in a single text-type column and wish to convert the JSON array serialized as TEXT
back into a native Greenplum array type, you can use the example query below:
SELECT user.id,
ARRAY(SELECT json_array_elements_text(coordinates.values::json))::int[] AS coords
FROM singleline_json_tbl_stc;
Note: This conversion is possible only when you are using PXF with Greenplum Database 6.x; the function json_array_elements_text()
is not available in Greenplum 5.x.
If your external table definition uses a single text-type column for a JSON array and you want to read the array into a TEXT[]
column, you can use the ALTER EXTERNAL TABLE
command to update the table definition. For example:
ALTER EXTERNAL TABLE singleline_json_tbl_stc ALTER COLUMN "coordinates.values" TYPE TEXT[];
If you choose to alter the external table definition in this manner, be sure to update any existing queries on the external table to account for the change in column type.