Reading JSON Data from HDFS

Use the PXF HDFS Connector to read JSON-format data. This section describes how to use PXF to access JSON data in HDFS, including how to create and query an external table that references a JSON file in the HDFS data store.

Prerequisites

Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read data from HDFS.

Working with JSON Data

JSON is a text-based data-interchange format. JSON data is typically stored in a file with a .json suffix.

A .json file will contain a collection of objects. A JSON object is a collection of unordered name/value pairs. A value can be a string, a number, true, false, null, or an object or an array. You can define nested JSON objects and arrays.

Sample JSON data file content:

  {
    "created_at":"MonSep3004:04:53+00002013",
    "id_str":"384529256681725952",
    "user": {
      "id":31424214,
      "location":"COLUMBUS"
    },
    "coordinates":{
      "type":"Point",
      "values":[
         13,
         99
      ]
    }
  }

In the sample above, user is an object composed of fields named id and location. To specify the nested fields in the user object as Greenplum Database external table columns, use . projection:

user.id
user.location

coordinates is an object composed of a text field named type and an array of integers named values.

To retrieve all of the elements of the values array in a single column, define the corresponding Greenplum Database external table column as type TEXT[].

"coordinates.values" TEXT[]

Refer to Introducing JSON for detailed information on JSON syntax.

JSON to Greenplum Database Data Type Mapping

To represent JSON data in Greenplum Database, map data values that use a primitive data type to Greenplum Database columns of the same type. JSON supports complex data types including projections and arrays. Use N-level projection to map members of nested objects and arrays to primitive data types.

The following table summarizes external mapping rules for JSON data.

Table 1. JSON Mapping

JSON Data Type	PXF/Greenplum Data Type
Primitive type (integer, float, string, boolean, null)	Use the corresponding Greenplum Database built-in data type; see Greenplum Database Data Types.
Array	Use `TEXT[]` to retrieve the JSON array as a Greenplum text array.
Object	Use dot `.` notation to specify each level of projection (nesting) to a member of a primitive or Array type.

JSON Data Read Modes

PXF supports two data read modes. The default mode expects one full JSON record per line. PXF also supports a read mode operating on JSON records that span multiple lines.

In upcoming examples, you will use both read modes to operate on a sample data set. The schema of the sample data set defines objects with the following member names and value data types:

“created_at” - text
“id_str” - text
“user” - object
- “id” - integer
- “location” - text
“coordinates” - object (optional)
- “type” - text
- “values” - array
  - [0] - integer
  - [1] - integer

The single-JSON-record-per-line data set follows:

{"created_at":"FriJun0722:45:03+00002013","id_str":"343136551322136576","user":{"id":395504494,"location":"NearCornwall"},"coordinates":{"type":"Point","values": [ 6, 50 ]}},
{"created_at":"FriJun0722:45:02+00002013","id_str":"343136547115253761","user":{"id":26643566,"location":"Austin,Texas"}, "coordinates": null},
{"created_at":"FriJun0722:45:02+00002013","id_str":"343136547136233472","user":{"id":287819058,"location":""}, "coordinates": null}

This is the data set for the multi-line JSON record data set:

{
  "root":[
    {
      "record_obj":{
        "created_at":"MonSep3004:04:53+00002013",
        "id_str":"384529256681725952",
        "user":{
          "id":31424214,
          "location":"COLUMBUS"
        },
        "coordinates":null
      },
      "record_obj":{
        "created_at":"MonSep3004:04:54+00002013",
        "id_str":"384529260872228864",
        "user":{
          "id":67600981,
          "location":"KryberWorld"
        },
        "coordinates":{
          "type":"Point",
          "values":[
             8,
             52
          ]
        }
      }
    }
  ]
}

You will create JSON files for the sample data sets and add them to HDFS in the next section.

Loading the Sample JSON Data to HDFS

The PXF HDFS connector reads native JSON stored in HDFS. Before you can use Greenplum Database to query JSON format data, the data must reside in your HDFS data store.

Copy and paste the single line JSON record sample data set above to a file named singleline.json. Similarly, copy and paste the multi-line JSON record data set to a file named multiline.json.

Note: Ensure that there are no blank lines in your JSON files.

Copy the JSON data files that you just created to your HDFS data store. Create the /data/pxf_examples directory if you did not do so in a previous exercise. For example:

$ hdfs dfs -mkdir /data/pxf_examples
$ hdfs dfs -put singleline.json /data/pxf_examples
$ hdfs dfs -put multiline.json /data/pxf_examples

Once the data is loaded to HDFS, you can use Greenplum Database and PXF to query and analyze the JSON data.

Creating the External Table

Use the hdfs:json profile to read JSON-format files from HDFS. The following syntax creates a Greenplum Database readable external table that references such a file:

CREATE EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-file>?PROFILE=hdfs:json[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

The specific keywords and values used in the Greenplum Database CREATE EXTERNAL TABLE command are described in the table below.

Keyword	Value
<path‑to‑hdfs‑file>	The path to the directory or file in the HDFS data store. When the `<server_name>` configuration includes a `pxf.fs.basePath` property setting, PXF considers <path‑to‑hdfs‑file> to be relative to the base path specified. Otherwise, PXF considers it to be an absolute path. <path‑to‑hdfs‑file> must not specify a relative path nor include the dollar sign (`$`) character.
PROFILE	The `PROFILE` keyword must specify `hdfs:json`.
SERVER=<server_name>	The named server configuration that PXF uses to access the data. PXF uses the `default` server if not specified.
<custom‑option>	<custom-option>s are discussed below.
FORMAT ‘CUSTOM’	Use `FORMAT` `'CUSTOM'` with the `hdfs:json` profile. The `CUSTOM` `FORMAT` requires that you specify `(FORMATTER='pxfwritable_import')`.

PXF supports single- and multi- line JSON records. When you want to read multi-line JSON records, you must provide an IDENTIFIER <custom-option> and value. Use this <custom-option> to identify the name of a field whose parent JSON object you want to be returned as individual tuples.

The hdfs:json profile supports the following <custom-option>s:

Option Keyword	Syntax, Example(s)	Description
IDENTIFIER	`&IDENTIFIER=<value>` `&IDENTIFIER=created_at`	You must include the `IDENTIFIER` keyword and <value> in the `LOCATION` string only when you are accessing JSON data comprised of multi-line records. Use the <value> to identify the name of the field whose parent JSON object you want to be returned as individual tuples.
IGNORE_MISSING_PATH	`&IGNORE_MISSING_PATH=<boolean>`	Specify the action to take when <path-to-hdfs-file> is missing or invalid. The default value is `false`, PXF returns an error in this situation. When the value is `true`, PXF ignores missing path errors and returns an empty fragment.

Note: When a nested object in a multi-line record JSON file includes a field with the same name as that of a parent object field and the field name is also specified as the IDENTIFIER, there is a possibility that PXF could return incorrect results. Should you need to, you can work around this edge case by compressing the JSON file, and having PXF read the compressed file.

Example: Reading a JSON File with Single Line Records

Use the following CREATE EXTERNAL TABLE SQL command to create a readable external table that references the single-line-per-record JSON data file and uses the PXF default server.

CREATE EXTERNAL TABLE singleline_json_tbl(
  created_at TEXT,
  id_str TEXT,
  "user.id" INTEGER,
  "user.location" TEXT,
  "coordinates.values" TEXT[]
)
LOCATION('pxf://data/pxf_examples/singleline.json?PROFILE=hdfs:json')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

Notice the use of . projection to access the nested fields in the user and coordinates objects.

To query the JSON data in the external table:

SELECT * FROM singleline_json_tbl;

To access specific elements of the coordinates.values array, you can specify the array subscript number in square brackets:

SELECT "coordinates.values"[1], "coordinates.values"[2] FROM singleline_json_tbl;

To access the array elements as some type other than TEXT, you can either cast the whole column:

SELECT "coordinates.values"::int[] FROM singleline_json_tbl;

or cast specific elements:

SELECT "coordinates.values"[1]::int, "coordinates.values"[2]::float FROM singleline_json_tbl;

Example: Reading a JSON file with Multi-Line Records

The SQL command to create a readable external table from the multi-line-per-record JSON file is very similar to that of the single line data set above. You must additionally specify the LOCATION clause IDENTIFIER keyword and an associated value when you want to read multi-line JSON records. For example:

CREATE EXTERNAL TABLE multiline_json_tbl(
  created_at TEXT,
  id_str TEXT,
  "user.id" INTEGER,
  "user.location" TEXT,
  "coordinates.values" TEXT[]
)
LOCATION('pxf://data/pxf_examples/multiline.json?PROFILE=hdfs:json&IDENTIFIER=created_at')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

created_at identifies the member name of the first field in the JSON record record_obj in the sample data schema.

To query the JSON data in this external table:

SELECT * FROM multiline_json_tbl;

Other Methods to Read a JSON Array

Starting in version 6.2.0, PXF supports reading a JSON array into a TEXT[] column. PXF still supports the old methods of using array element projection or a single text-type column to read a JSON array. These access methods are described here.

Using Array Element Projection

PXF supports accessing specific elements of a JSON array using the syntax [n] in the table definition to identify the specific element.

CREATE EXTERNAL TABLE singleline_json_tbl_aep(
  created_at TEXT,
  id_str TEXT,
  "user.id" INTEGER,
  "user.location" TEXT,
  "coordinates.values[0]" INTEGER,
  "coordinates.values[1]" INTEGER
)
LOCATION('pxf://data/pxf_examples/singleline.json?PROFILE=hdfs:json')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

Note: When you use this method to identify specific array elements, PXF provides only those values to Greenplum Database, not the whole JSON array.

If your existing external table definition uses array element projection and you want to read the array into a TEXT[] column, you can use the ALTER EXTERNAL TABLE command to update the table definition. For example:

ALTER EXTERNAL TABLE singleline_json_tbl_aep DROP COLUMN "coordinates.values[0]", DROP COLUMN "coordinates.values[1]", ADD COLUMN "coordinates.values" TEXT[];

If you choose to alter the external table definition in this manner, be sure to update any existing queries on the external table to account for the changes to column name and type.

Specifying a Single Text-type Column

PXF supports accessing all of the elements within an array as a single string containing the serialized JSON array by defining the corresponding Greenplum table column with one of the following data types: TEXT, VARCHAR, or BPCHAR.

CREATE EXTERNAL TABLE singleline_json_tbl_stc(
  created_at TEXT,
  id_str TEXT,
  "user.id" INTEGER,
  "user.location" TEXT,
  "coordinates.values" TEXT
)
LOCATION('pxf://data/pxf_examples/singleline.json?PROFILE=hdfs:json')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

If you retrieve the JSON array in a single text-type column and wish to convert the JSON array serialized as TEXT back into a native Greenplum array type, you can use the example query below:

SELECT user.id,
       ARRAY(SELECT json_array_elements_text(coordinates.values::json))::int[] AS coords
FROM singleline_json_tbl_stc;

Note: This conversion is possible only when you are using PXF with Greenplum Database 6.x; the function json_array_elements_text() is not available in Greenplum 5.x.

If your external table definition uses a single text-type column for a JSON array and you want to read the array into a TEXT[] column, you can use the ALTER EXTERNAL TABLE command to update the table definition. For example:

ALTER EXTERNAL TABLE singleline_json_tbl_stc ALTER COLUMN "coordinates.values" TYPE TEXT[];

If you choose to alter the external table definition in this manner, be sure to update any existing queries on the external table to account for the change in column type.