Reading and Writing JSON Data in HDFS

Use the PXF HDFS Connector to read and write JSON-format data. This section describes how to use PXF and external tables to access and write JSON data in HDFS.

Prerequisites

Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read data from or write data to HDFS.

Working with JSON Data

JSON is a text-based data-interchange format. A JSON data file contains one or more JSON objects. A JSON object is a collection of unordered name/value pairs. A value can be a string, a number, true, false, null, an object, or an array. You can define nested JSON objects and arrays.

JSON data is typically stored in a file with a .json or .jsonl (JSON Lines) suffix as described in the sections below.

About the PXF JSON Data Access Modes

PXF supports two data access modes for JSON files. The default mode expects one full JSON record per row (JSONL). PXF also supports an access mode that expects one JSON object per file where the JSON records may (but are not required to) span multiple lines.

Single Object Per Row

A JSON file can contain a single JSON object per row, where each row represents a database tuple. A JSON file that PXF reads that contains a single object per row may have any or no suffix. When writing, PXF creates the file with a .jsonl suffix.

Excerpt of sample single-object-per-row JSON data file:

{"id":1,"color":"red"}
{"id":2,"color":"yellow"}
{"id":3,"color":"green"}

Refer to JSON Lines for detailed information about this JSON syntax.

Single Object Per File

A JSON file can also contain a single, named, root level JSON object whose value is an array of JSON objects. When reading, the array may contain objects with arbitrary complexity and nesting, and PXF forms database tuples from objects that have a property named the same as that specified for the IDENTIFIER (discussed below). When writing, each JSON object in the array represents a database tuple. JSON files of this type have the .json suffix.

In the following example JSON data file, the root-level records object is an array of three objects (tuples):

{"records":[
{"id":1,"color":"red"}
,{"id":2,"color":"yellow"}
,{"id":3,"color":"green"}
]}

The records in the single JSON object may also span multiple lines:

{
  "records":[
    {
      "id":1,
      "color":"red"
    },
    {
      "id":2,
      "color":"yellow"
    },
    {
      "id":3,
      "color":"green"
    }
  ]
}

Refer to Introducing JSON for detailed information about this JSON syntax.

Data Type Mapping

To represent JSON data in Greenplum Database, map data values that use a primitive data type to Greenplum Database columns of the same type. JSON supports complex data types including projections and arrays.

Read Mapping

PXF uses the following data type mapping when reading JSON data:

JSON Data Type	PXF/Greenplum Data Type
boolean	boolean
number	{ bigint \| float8 \| integer \| numeric \| real \| smallint }
string	text
string (base64-encoded value)	bytea
string (date, time, timestamp, timestamptz in a text format that Greenplum understands)¹	{ date \| time \| timestamp \| timestamptz }
Array (one dimension) of type boolean[]	boolean[]
Array (one dimension) of type number[]	{ bigint[] \| float8[] \| integer[] \| numeric[] \| real[] \| smallint[] }
Array (one dimension) of type string[] (base64-encoded value)	bytea[]
Array (one dimension) of type string[] (date, time, timestamp in a text format that Greenplum understands)¹	{ date[] \| time[] \| timestamp[] \| timestamptz[] }
Array (one dimension) of type string[]	text[]
Array of other types	text[]
Object	Use dot `.` notation to specify each level of projection (nesting) to a member of a primitive or Array type.

¹ PXF returns an error if Greenplum cannot convert the date or time string to the target type.

When reading, you can use N-level projection to map members of nested objects and arrays to primitive data types.

Write Mapping

PXF supports writing primitive types and single dimension arrays of primitive types. PXF supports writing other complex types to JSON as string.

PXF uses the following data type mapping when writing JSON data:

PXF/Greenplum Data Type	JSON Data Type
bigint, float8, integer, numeric, real, smallint	number
boolean	boolean
bpchar, text, varchar	string
bytea	string (base64-encoded value)
date, time, timestamp, timestamptz	string
boolean[]	boolean[]
bigint[], float8[], int[], numeric[], real[], smallint[]	number[]
bytea[]	string[] (base64-encoded value)
date[], time[], timestamp[], timestamptz[]	string[]

About Using Projection (Read)

In the example JSON data file excerpt below, user is an object composed of fields named id and location:

  {
    "created_at":"MonSep3004:04:53+00002013",
    "id_str":"384529256681725952",
    "user": {
      "id":31424214,
      "location":"COLUMBUS"
    },
    "coordinates":{
      "type":"Point",
      "values":[
         13,
         99
      ]
    }
  }

To specify the nested fields in the user object directly as Greenplum Database external table columns, use . projection:

user.id
user.location

coordinates is an object composed of a text field named type and an array of integers named values.

To read all of the elements of the values array in a single column, define the corresponding Greenplum Database external table column as type int[].

"coordinates.values" int[]

Creating the External Table

Use the hdfs:json profile to read or write JSON-format data in HDFS. The following syntax creates a Greenplum Database external table that references such a file:

CREATE [WRITABLE] EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-file>?PROFILE=hdfs:json[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'|'pxfwritable_export')
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];

The specific keywords and values used in the Greenplum Database CREATE EXTERNAL TABLE command are described in the table below.

Keyword	Value
<path‑to‑hdfs‑file>	The path to the directory or file in the HDFS data store. When the `<server_name>` configuration includes a `pxf.fs.basePath` property setting, PXF considers <path‑to‑hdfs‑file> to be relative to the base path specified. Otherwise, PXF considers it to be an absolute path. <path‑to‑hdfs‑file> must not specify a relative path nor include the dollar sign (`$`) character.
PROFILE	The `PROFILE` keyword must specify `hdfs:json`.
SERVER=<server_name>	The named server configuration that PXF uses to access the data. PXF uses the `default` server if not specified.
<custom‑option>	<custom-option>s for read and write operations are identified below.
FORMAT ‘CUSTOM’	Use `FORMAT` ‘`CUSTOM`’ with `(FORMATTER='pxfwritable_export')` (write) or `(FORMATTER='pxfwritable_import')` (read).

PXF supports reading from and writing to JSON files that contain either an object per row (the default) or that contain a JSON single object. When the JSON file(s) that you want to read or write contains a single object, you must provide an IDENTIFIER <custom-option> and value. Use this option to identify the name of a field whose parent JSON object you want PXF to return or write as an individual tuple.

The hdfs:json profile supports the following custom read options:

Option Keyword	Description
IDENTIFIER=<value>	When the JSON data that you are reading is comprised of a single JSON object, you must specify an `IDENTIFIER` to identify the name of the field whose parent JSON object you want PXF to return as an individual tuple.
SPLIT_BY_FILE=<boolean>	Specify how PXF splits the data in <path-to-hdfs-file>. The default value is `false`, PXF creates multiple splits for each file that it will process in parallel. When set to `true`, PXF creates and processes a single split per file.
IGNORE_MISSING_PATH=<boolean>	Specify the action to take when <path-to-hdfs-file> is missing or invalid. The default value is `false`, PXF returns an error in this situation. When the value is `true`, PXF ignores missing path errors and returns an empty fragment.

Note: When a nested object in a single object JSON file includes a field with the same name as that of a parent object field and the field name is also specified as the IDENTIFIER, there is a possibility that PXF could return incorrect results. Should you need to, you can work around this edge case by compressing the JSON file, and using PXF to read the compressed file.

The hdfs:json profile supports the following custom write options:

Option	Value Description
ROOT=<value>	When writing to a single JSON object, identifies the name of the root-level object attribute.
COMPRESSION_CODEC	The compression codec alias. Supported compression codecs for writing json data include: `default`, `bzip2`, `gzip`, and `uncompressed`. If this option is not provided, Greenplum Database performs no data compression.
DISTRIBUTED BY	If you are loading data from an existing Greenplum Database table into the writable external table, consider specifying the same distribution policy or `<column_name>` on both tables. Doing so will avoid extra motion of data between segments on the load operation.

When you specify compression for a JSON write operation, PXF names the files that it writes <basename>.<json_file_type>.<compression_extension>. For example: jan_sales.jsonl.gz.

Read Examples

Example Data Sets

In upcoming read examples, you use both JSON access modes to operate on a sample data set. The schema of the sample data set defines objects with the following member names and value data types:

“created_at” - text
“id_str” - text
“user” - object
- “id” - integer
- “location” - text
“coordinates” - object (optional)
- “type” - text
- “values” - array
  - [0] - integer
  - [1] - integer

The data set for the single-object-per-row (JSONL) access mode follows:

{"created_at":"FriJun0722:45:03+00002013","id_str":"343136551322136576","user":{"id":395504494,"location":"NearCornwall"},"coordinates":{"type":"Point","values": [ 6, 50 ]}},
{"created_at":"FriJun0722:45:02+00002013","id_str":"343136547115253761","user":{"id":26643566,"location":"Austin,Texas"}, "coordinates": null},
{"created_at":"FriJun0722:45:02+00002013","id_str":"343136547136233472","user":{"id":287819058,"location":""}, "coordinates": null}

The data set for the single-object-per-file JSON access mode follows:

{
  "root":[
    {
      "record_obj":{
        "created_at":"MonSep3004:04:53+00002013",
        "id_str":"384529256681725952",
        "user":{
          "id":31424214,
          "location":"COLUMBUS"
        },
        "coordinates":null
      },
      "record_obj":{
        "created_at":"MonSep3004:04:54+00002013",
        "id_str":"384529260872228864",
        "user":{
          "id":67600981,
          "location":"KryberWorld"
        },
        "coordinates":{
          "type":"Point",
          "values":[
             8,
             52
          ]
        }
      }
    }
  ]
}

You will create JSON files for the sample data sets and add them to HDFS in the next section.

Loading the Sample JSON Data to HDFS

The PXF HDFS connector can read and write native JSON stored in HDFS.

Copy and paste the object-per-row JSON sample data set above to a file named objperrow.jsonl. Similarly, copy and paste the single object per file JSON record data set to a file named singleobj.json.

Note
Ensure that there are no blank lines in your JSON files.

Copy the JSON data files that you just created to your HDFS data store. Create the /data/pxf_examples directory if you did not do so in a previous exercise. For example:

$ hdfs dfs -mkdir /data/pxf_examples
$ hdfs dfs -put objperrow.jsonl /data/pxf_examples/
$ hdfs dfs -put singleobj.json /data/pxf_examples/

Once the data is loaded to HDFS, you can use Greenplum Database and PXF to query and add to the JSON data.

Example: Single Object Per Row (Read)

Use the following CREATE EXTERNAL TABLE SQL command to create a readable external table that references the single-object-per-row JSON data file and uses the PXF default server.

CREATE EXTERNAL TABLE objperrow_json_tbl(
  created_at TEXT,
  id_str TEXT,
  "user.id" INTEGER,
  "user.location" TEXT,
  "coordinates.values" INTEGER[]
)
LOCATION('pxf://data/pxf_examples/objperrow.jsonl?PROFILE=hdfs:json')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

This table reads selected fields in the JSON file. Notice the use of . projection to access the nested fields in the user and coordinates objects.

To view the JSON data in the file, query the external table:

SELECT * FROM objperrow_json_tbl;

To access specific elements of the coordinates.values array, you can specify the array subscript number in square brackets:

SELECT "coordinates.values"[1], "coordinates.values"[2] FROM objperrow_json_tbl;

Example: Single Object Per File (Read)

The SQL command to create a readable external table for a single object JSON file is very similar to that of the single object per row data set above. You must additionally specify the LOCATION clause IDENTIFIER keyword and an associated value. For example:

CREATE EXTERNAL TABLE singleobj_json_tbl(
  created_at TEXT,
  id_str TEXT,
  "user.id" INTEGER,
  "user.location" TEXT,
  "coordinates.values" INTEGER[]
)
LOCATION('pxf://data/pxf_examples/singleobj.json?PROFILE=hdfs:json&IDENTIFIER=created_at')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

created_at identifies the member name of the first field in the JSON record record_obj in the sample data schema.

To view the JSON data in the file, query the external table:

SELECT * FROM singleobj_json_tbl;

Writing JSON Data

To write JSON data, you create a writable external table that references the name of a directory on HDFS. When you insert records into the writable external table, PXF writes the block(s) of data that you insert to one or more files in the directory that you specified. In the default case (single object per row), PXF writes the data to a .jsonl file. When you specify a ROOT attribute (single object per file), PXF writes to a .json file.

Note
When writing JSON data, PXF supports only scalar or one dimensional arrays of Greenplum data types. PXF does not support column projection when writing JSON data.

Writable external tables can only be used for INSERT operations. If you want to query the data that you inserted, you must create a separate readable external table that references the HDFS directory and read from that table.

The write examples use a data schema similar to that of the read examples.

Example: Single Object Per Row (Write)

In this example, we add data to a directory named jsopr.

Use the following CREATE EXTERNAL TABLE SQL command to create a writable external table that writes JSON data in single-object-per-row format and uses the PXF default server.

CREATE WRITABLE EXTERNAL TABLE add_objperrow_json_tbl(
  created_at TEXT,
  id_str TEXT,
  id INTEGER,
  location TEXT,
  coordinates INTEGER[]
)
LOCATION('pxf://data/pxf_examples/jsopr?PROFILE=hdfs:json')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');

Write data to the table:

INSERT INTO add_objperrow_json_tbl VALUES ( 'SunJun0912:59:07+00002013', '343136551111111111', 311111111, 'FarAway', '{ 6, 50 }' );
INSERT INTO add_objperrow_json_tbl VALUES ( 'MonJun1002:12:06+00002013', '343136557777777777', 377777777, 'NearHere', '{ 13, 93 }' );

Read the data that you just wrote. Recall that you must first create a readable external table:

CREATE EXTERNAL TABLE jsopr_tbl(
  created_at TEXT,
  id_str TEXT,
  id INTEGER,
  location TEXT,
  coordinates INTEGER[]
)
LOCATION('pxf://data/pxf_examples/jsopr?PROFILE=hdfs:json')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

Query the table:

SELECT * FROM jsopr_tbl;

        created_at         |       id_str       |    id     | location | coordinates 
---------------------------+--------------------+-----------+----------+-------------
 MonJun1002:12:06+00002013 | 343136557777777777 | 377777777 | NearHere | {13,93}
 SunJun0912:59:07+00002013 | 343136551111111111 | 311111111 | FarAway  | {6,50}
(2 rows)

View the files added to HDFS:

$ hdfs dfs -cat /data/pxf_examples/jsopr/*
{"created_at":"SunJun0912:59:07+00002013","id_str":"343136551111111111","id":311111111,"location":"FarAway","coordinates":[6,50]}
{"created_at":"MonJun1002:12:06+00002013","id_str":"343136557777777777","id":377777777,"location":"NearHere","coordinates":[13,93]}

Notice that PXF creates a flat JSON structure.

Example: Single Object Per File (Write)

Use the following CREATE EXTERNAL TABLE SQL command to create a writable external table that writes JSON data in single object format and uses the PXF default server.

You must specify the ROOT keyword and associated value in the LOCATION clause. For example:

CREATE WRITABLE EXTERNAL TABLE add_singleobj_json_tbl(
  created_at TEXT,
  id_str TEXT,
  id INTEGER,
  location TEXT,
  coordinates INTEGER[]
)
LOCATION('pxf://data/pxf_examples/jso?PROFILE=hdfs:json&ROOT=root')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');

root identifies the name of the root attribute of the single object.