With the explosion of data stores and cloud services, data now resides across many disparate systems and in a variety of formats. Often, data is classified both by its location and the operations performed on the data, as well as how often the data is accessed: real-time or transactional (hot), less frequent (warm), or archival (cold).
The diagram below describes a data source that tracks monthly sales across many years. Real-time operational data is stored in MySQL. Data subject to analytic and business intelligence operations is stored in Greenplum Database. The rarely accessed, archival data resides in AWS S3.
When multiple, related data sets exist in external systems, it is often more efficient to join data sets remotely and return only the results, rather than negotiate the time and storage requirements of performing a rather expensive full data load operation. The Greenplum Platform Extension Framework (PXF), a Greenplum extension that provides parallel, high throughput data access and federated query processing, provides this capability.
With PXF, you can use Greenplum and SQL to query these heterogeneous data sources:
And these data formats:
You use PXF to map data from an external source to a Greenplum Database external table definition. You can then use the PXF external table and SQL to:
Check out the PXF introduction for a high level overview of important PXF concepts.
The Greenplum Database administrator manages PXF, Greenplum Database user privileges, and external data source configuration. Tasks include:
A Greenplum Database user creates a PXF external table that references a file or other data in the external data source, and uses the external table to query or load the external data in Greenplum. Tasks are external data store-dependent: