Runs Greenplum MapReduce jobs as defined in a YAML specification document.
gpmapreduce -f <yaml_file> [<dbname> [<username>]]
[-k <name=value> | --key <name=value>]
[-h <hostname> | --host <hostname>] [-p <port>| --port <port>]
[-U <username> | --username <username>] [-W] [-v]
gpmapreduce -x | --explain
gpmapreduce -X | --explain-analyze
gpmapreduce -V | --version
gpmapreduce -h | --help
The following are required prior to running this program:
EXEC
and FILE
inputs.GPFDIST
input unless the the user has the appropriate rigths granted. See the Greenplum Database Reference Guide for more information.MapReduce is a programming model developed by Google for processing and generating large data sets on an array of commodity servers. Greenplum MapReduce allows programmers who are familiar with the MapReduce paradigm to write map and reduce functions and submit them to the Greenplum Database parallel engine for processing.
In order for Greenplum to be able to process MapReduce functions, the functions need to be defined in a YAML document, which is then passed to the Greenplum MapReduce program, gpmapreduce
, for execution by the Greenplum Database parallel engine. The Greenplum system takes care of the details of distributing the input data, executing the program across a set of machines, handling machine failures, and managing the required inter-machine communication.
Connection Options
PGHOST
or defaults to localhost.
PGPORT
or defaults to 5432.
PGUSER
or defaults to the current system user name.
Run a MapReduce job as defined in my_yaml.txt
and connect to the database mydatabase
:
gpmapreduce -f my_yaml.txt mydatabase
Greenplum MapReduce specification in the Greenplum Database Reference Guide