gpmapreduce configuration file.
%YAML 1.1
---
[VERSION](#VERSION): 1.0.0.2
[DATABASE](#DATABASE): dbname
[USER](#USER): db_username
[HOST](#HOST): master_hostname
[PORT](#PORT): master_port
- [DEFINE](#DEFINE):
- [INPUT](#INPUT):
[NAME](#NAME): input_name
[FILE](#FILE):
- *hostname*: /path/to/file
[GPFDIST](#GPFDIST):
- *hostname*:port/file_pattern
[TABLE](#TABLE): table_name
[QUERY](#QUERY): SELECT_statement
[EXEC](#EXEC): command_string
[COLUMNS](#COLUMNS):
- field_name data_type
[FORMAT](#FORMAT): TEXT | CSV
[DELIMITER](#DELIMITER): delimiter_character
[ESCAPE](#ESCAPE): escape_character
[NULL](#NULL): null_string
[QUOTE](#QUOTE): csv_quote_character
[ERROR\_LIMIT](#ERROR_LIMIT): integer
[ENCODING](#ENCODING): database_encoding
- [OUTPUT](#OUTPUT):
[NAME](#OUTPUTNAME): output_name
[FILE](#OUTPUTFILE): file_path_on_client
[TABLE](#OUTPUTTABLE): table_name
[KEYS](#KEYS): - column_name
[MODE](#MODE): REPLACE | APPEND
- [MAP](#MAP):
[NAME](#NAME): function_name
[FUNCTION](#FUNCTION): function_definition
[LANGUAGE](#LANGUAGE): perl | python | c
[LIBRARY](#LIBRARY): /path/filename.so
[PARAMETERS](#PARAMETERS):
- nametype
[RETURNS](#RETURNS):
- nametype
[OPTIMIZE](#OPTIMIZE): STRICT IMMUTABLE
[MODE](#MODE): SINGLE | MULTI
- [TRANSITION \| CONSOLIDATE \| FINALIZE](#TCF):
[NAME](#TCFNAME): function_name
[FUNCTION](#FUNCTION): function_definition
[LANGUAGE](#LANGUAGE): perl | python | c
[LIBRARY](#LIBRARY): /path/filename.so
[PARAMETERS](#PARAMETERS):
- nametype
[RETURNS](#RETURNS):
- nametype
[OPTIMIZE](#OPTIMIZE): STRICT IMMUTABLE
[MODE](#TCFMODE): SINGLE | MULTI
- [REDUCE](#REDUCE):
[NAME](#REDUCENAME): reduce_job_name
[TRANSITION](#TRANSITION): transition_function_name
[CONSOLIDATE](#CONSOLIDATE): consolidate_function_name
[FINALIZE](#FINALIZE): finalize_function_name
[INITIALIZE](#INITIALIZE): value
[KEYS](#REDUCEKEYS):
- key_name
- [TASK](#TASK):
[NAME](#TASKNAME): task_name
[SOURCE](#SOURCE): input_name
[MAP](#TASKMAP): map_function_name
[REDUCE](#REDUCE): reduce_function_name
[EXECUTE](#EXECUTE):
- [RUN](#RUN):
[SOURCE](#EXECUTESOURCE): input_or_task_name
[TARGET](#TARGET): output_name
[MAP](#EXECUTEMAP): map_function_name
[REDUCE](#EXECUTEREDUCE): reduce_function_name...
You specify the input, map and reduce tasks, and the output for the Greenplum MapReduce gpmapreduce
program in a YAML-formatted configuration file. (This reference page uses the name gpmapreduce.yaml
when referring to this file; you may choose your own name for the file.)
The gpmapreduce
utility processes the YAML configuration file in order, using indentation (spaces) to determine the document hierarchy and the relationships between the sections. The use of white space in the file is significant.
$PGDATABASE
if set.
$PGUSER
if set. You must be a Greenplum superuser to run functions written in untrusted Python and Perl. Regular database users can run functions written in trusted Perl. You also must be a database superuser to run MapReduce jobs that contain
FILE,
GPFDIST and
EXEC input types.
$PGHOST
if set.
$PGPORT
if set.
Required. A sequence of definitions for this MapReduce document. The DEFINE
section must have at least one INPUT
definition.
gpfdist
file reference, a table in the database, an SQL command, or an operating system command. See `` for information about this reference.
seghostname:/path/to/filename
. You must be a Greenplum Database superuser to run MapReduce jobs with
FILE
input. The file must reside on a Greenplum segment host.
gpfdist
file servers in the format:
hostname[:port]/file_pattern
. You must be a Greenplum Database superuser to run MapReduce jobs with
GPFDIST
input.
SELECT
command to run within the database.
EXEC
input.
column_name``[``data_type``]
. If not specified, the default is
value text
. The
DELIMITER character is what separates two data value fields (columns). A row is determined by a line feed character (
0x0a
).
TEXT
) or comma separated values (
CSV
) format. If the data format is not specified, defaults to
TEXT
.
TEXT
mode, a comma in
CSV
mode. The delimiter character must only appear between any two data value fields. Do not place a delimiter at the beginning or end of a row.
\n
,
\t
,
\100
, and so on) and for escaping data characters that might otherwise be taken as row or column delimiters. Make sure to choose an escape character that is not used anywhere in your actual column data. The default escape character is a \ (backslash) for text-formatted files and a
"
(double quote) for csv-formatted files, however it is possible to specify another character to represent an escape. It is also possible to deactivate escaping by specifying the value
'OFF'
as the escape value. This is very useful for data such as text-formatted web log data that has many embedded backslashes that are not intended to be escapes.
\N
in
TEXT
format, and an empty value with no quotations in
CSV
format. You might prefer an empty string even in
TEXT
mode for cases where you do not want to distinguish nulls from empty strings. Any input data item that matches this string will be considered a null value.
CSV
formatted files. The default is a double quote (
"
). In
CSV
formatted files, data value fields must be enclosed in double quotes if they contain any commas or embedded new lines. Fields that contain double quote characters must be surrounded by double quotes, and the embedded double quotes must each be represented by a pair of consecutive double quotes. It is important to always open and close quotes correctly in order for data rows to be parsed correctly.
'SQL_ASCII'
), an integer encoding number, or
DEFAULT
to use the default client encoding. See
Character Set Support for more information.
STDOUT
(standard output of the client). You can send output to a file on the client host or to an existing table in the database.
STDOUT
. Names must be unique with regards to the names of other objects in this MapReduce job (such as map function, task, reduce function and input names). Also, names cannot conflict with existing objects in the database (such as tables, functions or views).
/path/to/filename
.
REDUCE
keys will be used as the table distribution key by default. Otherwise, the first column of the table will be used as the distribution key.
APPEND
adds output data to an existing table (provided the table schema matches the output format) without removing any existing data. Declaring
REPLACE
will drop the table if it exists and then recreate it. Both
APPEND
and
REPLACE
will create a new table if one does not exist.
MAP
function takes data structured in (
key
,
value
) pairs, processes each pair, and generates zero or more output (
key
,
value
) pairs. The Greenplum MapReduce framework then collects all pairs with the same key from all output lists and groups them together. This output is then passed to the
REDUCE task, which is comprised of
TRANSITION | CONSOLIDATE | FINALIZE functions. There is one predefined
MAP
function named
IDENTITY
that returns (
key
,
value
) pairs unchanged. Although (
key
,
value
) are the default parameters, you can specify other prototypes as needed.
TRANSITION
,
CONSOLIDATE
and
FINALIZE
are all component pieces of
REDUCE. A
TRANSITION
function is required.
CONSOLIDATE
and
FINALIZE
functions are optional. By default, all take
state
as the first of their input
PARAMETERS, but other prototypes can be defined as well.
A TRANSITION
function iterates through each value of a given key and accumulates values in a state
variable. When the transition function is called on the first value of a key, the state
is set to the value specified by INITIALIZE of a REDUCE job (or the default state value for the data type). A transition takes two arguments as input; the current state of the key reduction, and the next value, which then produces a new state
.
If a CONSOLIDATE
function is specified, TRANSITION
processing is performed at the segment-level before redistributing the keys across the Greenplum interconnect for final aggregation (two-phase aggregation). Only the resulting state
value for a given key is redistributed, resulting in lower interconnect traffic and greater parallelism. CONSOLIDATE
is handled like a TRANSITION
, except that instead of (state + value) => state
, it is (state + state) => state
.
If a FINALIZE
function is specified, it takes the final state
produced by CONSOLIDATE
(if present) or TRANSITION
and does any final processing before emitting the final result. TRANSITION
and CONSOLIDATE
functions cannot return a set of values. If you need a REDUCE
job to return a set, then a FINALIZE
is necessary to transform the final state into a set of output values.
FUNCTION
is not specified, then a built-in database function corresponding to
NAME is used.
perl
,
python
, and
C
. If calling a built-in database function,
LANGUAGE
should not be specified.
text
.
MAP
default - key text
, value text
TRANSITION
default - state text
, value text
CONSOLIDATE
default - state1 text
, state2 text
(must have exactly two input parameters of the same data type)
FINALIZE
default - state text
(single parameter only)
text
.
MAP
default - key text
, value text
TRANSITION
default - state text
(single return value only)
CONSOLIDATE
default - state text
(single return value only)
FINALIZE
default - value text
STRICT
- function is not affected by NULL
values
IMMUTABLE
- function will always return the same value for a given input
MULTI
- returns 0 or more rows per input record. The return value of the function must be an array of rows to return, or the function must be written as an iterator using yield
in Python or return_next
in Perl. MULTI
is the default mode for MAP
and FINALIZE
functions.
SINGLE
- returns exactly one row per input record. SINGLE
is the only mode supported for TRANSITION
and CONSOLIDATE
functions. When used with MAP
and FINALIZE
functions, SINGLE
mode can provide modest performance improvement.
REDUCE
definition names the
TRANSITION | CONSOLIDATE | FINALIZE functions that comprise the reduction of (
key
,
value
) pairs to the final result set. There are also several predefined
REDUCE
jobs you can run, which all operate over a column named
value
:
IDENTITY
- returns (key, value) pairs unchanged
SUM
- calculates the sum of numeric data
AVG
- calculates the average of numeric data
COUNT
- calculates the count of input data
MIN
- calculates minimum value of numeric data
MAX
- calculates maximum value of numeric data
REDUCE
job. Names must be unique with regards to the names of other objects in this MapReduce job (function, task, input and output names). Also, names cannot conflict with existing objects in the database (such as tables, functions or views).
TRANSITION
function.
CONSOLIDATE
function.
FINALIZE
function.
text
and
float
data types. Required for all other data types. The default value for text is
''
. The default value for float is
0.0
. Sets the initial
state
value of the
TRANSITION
function.
[key, *]
. When using a multi-column reduce it may be necessary to specify which columns are key columns and which columns are value columns. By default, any input columns that are not passed to the
TRANSITION
function are key columns, and a column named
key
is always a key column even if it is passed to the
TRANSITION
function. The special indicator
*
indicates all columns not passed to the
TRANSITION
function. If this indicator is not present in the list of keys then any unmatched columns are discarded.
TASK
defines a complete end-to-end
INPUT
/
MAP
/
REDUCE
stage within a Greenplum MapReduce job pipeline. It is similar to
EXECUTE except it is not immediately run. A task object can be called as
INPUT to further processing stages.
TASK
.
IDENTITY
.
IDENTITY
.
Required. EXECUTE
defines the final INPUT
/MAP
/REDUCE
stage within a Greenplum MapReduce job pipeline.
STDOUT
.
IDENTITY
.
IDENTITY
.