Data Science Package for Python

Greenplum Database provides a collection of data science-related Python modules that can be used with the Greenplum Database PL/Python language. You can download these modules in .gppkg format from Broadcom Support Portal under the desired Greemplum release. Separate modules are provided for Python 2.7 and Python 3.9 development on RHEL7, RHEL8, and Ubuntu platforms.

Note
For more information about download prerequisites, troubleshooting, and instructions, see Download Broadcom products and software.

This section contains the following information:

Data Science Package for Python 2.7 Modules
Data Science Package for Python 3.9 Modules
Installing a Data Science Package for Python
Uninstalling a Data Science Package for Python

For information about the Greenplum Database PL/Python Language, see Greenplum PL/Python Language Extension.

Parent topic: Installing Optional Extensions (VMware Greenplum)

Data Science Package for Python 2.7 Modules

The following table lists the modules that are provided in the Data Science Package for Python 2.7.

Packages required for Deep Learning features of MADlib are now included. Note that it is not supported for RHEL 6.

Module Name	Description/Used For
atomicwrites	Atomic file writes
attrs	Declarative approach for defining class attributes
Autograd	Gradient-based optimization
backports.functools-lru-cache	Backports `functools.lru_cache` from Python 3.3
Beautiful Soup	Navigating HTML and XML
Blis	Blis linear algebra routines
Boto	Amazon Web Services library
Boto3	The AWS SDK
botocore	Low-level, data-driven core of boto3
Bottleneck	Fast NumPy array functions
Bz2file	Read and write bzip2-compressed files
Certifi	Provides Mozilla CA bundle
Chardet	Universal encoding detector for Python 2 and 3
ConfigParser	Updated `configparser` module
contextlib2	Backports and enhancements for the `contextlib` module
Cycler	Composable style cycles
cymem	Manage calls to calloc/free through Cython
Docutils	Python documentation utilities
enum34	Backport of Python 3.4 Enum
Funcsigs	Python function signatures from PEP362
functools32	Backport of the `functools` module from Python 3.2.3
funcy	Functional tools focused on practicality
future	Compatibility layer between Python 2 and Python 3
futures	Backport of the `concurrent.futures` package from Python 3
Gensim	Topic modeling and document indexing
h5py	Read and write HDF5 files
idna	Internationalized Domain Names in Applications (IDNA)
importlib-metadata	Read metadata from Python packages
Jinja2	Stand-alone template engine
JMESPath	JSON Matching Expressions
Joblib	Python functions as pipeline jobs
jsonschema	JSON Schema validation
Keras (RHEL/CentOS 7 only)	Deep learning
Keras Applications	Reference implementations of popular deep learning models
Keras Preprocessing	Easy data preprocessing and data augmentation for deep learning models
kiwisolver	A fast implementation of the Cassowary constraint solver
Lifelines	Survival analysis
lxml	XML and HTML processing
MarkupSafe	Safely add untrusted strings to HTML/XML markup
Matplotlib	Python plotting package
mock	Rolling backport of `unittest.mock`
more-itertools	More routines for operating on iterables, beyond itertools
MurmurHash	Cython bindings for MurmurHash
NLTK	Natural language toolkit
NumExpr	Fast numerical expression evaluator for NumPy
NumPy	Scientific computing
packaging	Core utilities for Python packages
Pandas	Data analysis
pathlib, pathlib2	Object-oriented filesystem paths
patsy	Package for describing statistical models and for building design matrices
Pattern-en	Part-of-speech tagging
pip	Tool for installing Python packages
plac	Command line arguments parser
pluggy	Plugin and hook calling mechanisms
preshed	Cython hash table that trusts the keys are pre-hashed
protobuf	Protocol buffers
py	Cross-python path, ini-parsing, io, code, log facilities
pyLDAvis	Interactive topic model visualization
PyMC3	Statistical modeling and probabilistic machine learning
pyparsing	Python parsing
pytest	Testing framework
python-dateutil	Extensions to the standard Python datetime module
pytz	World timezone definitions, modern and historical
PyXB-X (Python3 only)	To generate Python code for classes that correspond to data structures defined by XMLSchema
PyYAML	YAML parser and emitter
regex	Alternative regular expression module, to replace re
requests	HTTP library
s3transfer	Amazon S3 transfer manager
scandir	Directory iteration function
scikit-learn	Machine learning data mining and analysis
SciPy	Scientific computing
setuptools	Download, build, install, upgrade, and uninstall Python packages
six	Python 2 and 3 compatibility library
smart-open	Utilities for streaming large files (S3, HDFS, gzip, bz2, and so forth)
spaCy	Large scale natural language processing
srsly	Modern high-performance serialization utilities for Python
StatsModels	Statistical modeling
subprocess32	Backport of the subprocess module from Python 3
Tensorflow (RHEL/CentOS 7 only)	Numerical computation using data flow graphs
Theano	Optimizing compiler for evaluating mathematical expressions on CPUs and GPUs
thinc	Practical Machine Learning for NLP
tqdm	Fast, extensible progress meter
urllib3	HTTP library with thread-safe connection pooling, file post, and more
wasabi	Lightweight console printing and formatting toolkit
wcwidth	Measures number of Terminal column cells of wide-character codes
Werkzeug	Comprehensive WSGI web application library
wheel	A built-package format for Python
XGBoost	Gradient boosting, classifying, ranking
zipp	Backport of pathlib-compatible object wrapper for zip files

Data Science Package for Python 3.9 Modules

The following table lists the modules that are provided in the Data Science Package for Python 3.9.

Module Name	Description/Used For
absl-py	Abseil Python Common Libraries
arviz	Exploratory analysis of Bayesian models
astor	Read/rewrite/write Python ASTs
astunparse	An AST unparser for Python
autograd	Efficiently computes derivatives of numpy code
autograd-gamma	autograd compatible approximations to the derivatives of the Gamma-family of functions
backports.csv	Backport of Python 3 csv module
beautifulsoup4	Screen-scraping library
blis	The Blis BLAS-like linear algebra library, as a self-contained C-extension
cachetools	Extensible memoizing collections and decorators
catalogue	Super lightweight function registries for your library
certifi	Python package for providing Mozilla's CA Bundle
cffi	Foreign Function Interface for Python calling C code
cftime	Time-handling functionality from netcdf4-python
charset-normalizer	The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet.
cheroot	Highly-optimized, pure-python HTTP server
CherryPy	Object-Oriented HTTP framework
click	Composable command line interface toolkit
convertdate	Converts between Gregorian dates and other calendar systems
cryptography	A set of functions useful in cryptography and linear algebra
cycler	Composable style cycles
cymem	Manage calls to calloc/free through Cython
Cython	The Cython compiler for writing C extensions for the Python language
deprecat	Python @deprecat decorator to deprecate old python classes, functions or methods
dill	serialize all of python
fastprogress	A nested progress with plotting options for fastai
feedparser	Universal feed parser, handles RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0 feeds
filelock	A platform independent file lock
flatbuffers	The FlatBuffers serialization format for Python
fonttools	Tools to manipulate font files
formulaic	An implementation of Wilkinson formulas
funcy	A fancy and practical functional tools
future	Clean single-source support for Python 3 and 2
gast	Python AST that abstracts the underlying Python version
gensim	Python framework for fast Vector Space Modelling
gluonts	GluonTS is a Python toolkit for probabilistic time series modeling, built around MXNet
google-auth	Google Authentication Library
google-auth-oauthlib	Google Authentication Library
google-pasta	pasta is an AST-based Python refactoring library
graphviz	Simple Python interface for Graphviz
greenlet	Lightweight in-process concurrent programming
grpcio	HTTP/2-based RPC framework
h5py	Read and write HDF5 files from Python
hijri-converter	Accurate Hijri-Gregorian dates converter based on the Umm al-Qura calendar
holidays	Generate and work with holidays in Python
idna	Internationalized Domain Names in Applications (IDNA)
importlib-metadata	Read metadata from Python packages
interface-meta	Provides a convenient way to expose an extensible API with enforced method signatures and consistent documentation
jaraco.classes	Utility functions for Python class constructs
jaraco.collections	Collection objects similar to those in stdlib by jaraco
jaraco.context	Context managers by jaraco
jaraco.functools	Functools like those found in stdlib
jaraco.text	Module for text manipulation
Jinja2	A very fast and expressive template engine
joblib	Lightweight pipelining with Python functions
keras	Deep learning for humans
Keras-Preprocessing	Easy data preprocessing and data augmentation for deep learning models
kiwisolver	A fast implementation of the Cassowary constraint solver
korean-lunar-calendar	Korean Lunar Calendar
langcodes	Tools for labeling human languages with IETF language tags
libclang	Clang Python Bindings, mirrored from the official LLVM repo
lifelines	Survival analysis in Python, including Kaplan Meier, Nelson Aalen and regression
llvmlite	lightweight wrapper around basic LLVM functionality
lxml	Powerful and Pythonic XML processing library combining libxml2/libxslt with the ElementTree API
Markdown	Python implementation of Markdown
MarkupSafe	Safely add untrusted strings to HTML/XML markup
matplotlib	Python plotting package
more-itertools	More routines for operating on iterables, beyond itertools
murmurhash	Cython bindings for MurmurHash
mxnet	An ultra-scalable deep learning framework
mysqlclient	Python interface to MySQL
netCDF4	Provides an object-oriented python interface to the netCDF version 4 library
nltk	Natural language toolkit
numba	Compiling Python code using LLVM
numexpr	Fast numerical expression evaluator for NumPy
numpy	Scientific computing
oauthlib	A generic, spec-compliant, thorough implementation of the OAuth request-signing logic
opt-einsum	Optimizing numpys einsum function
packaging	Core utilities for Python packages
pandas	Data analysis
pathy	pathlib.Path subclasses for local and cloud bucket storage
patsy	Package for describing statistical models and for building design matrices
Pattern	Web mining module for Python
pdfminer.six	PDF parser and analyzer
Pillow	Python Imaging Library
pmdarima	Python's forecast::auto.arima equivalent
portend	TCP port monitoring and discovery
preshed	Cython hash table that trusts the keys are pre-hashed
prophet	Automatic Forecasting Procedure
protobuf	Protocol buffers
psycopg2	PostgreSQL database adapter for Python
psycopg2-binary	psycopg2 - Python-PostgreSQL Database Adapter
pyasn1	ASN.1 types and codecs
pyasn1-modules	pyasn1-modules
pycparser	C parser in Python
pydantic	Data validation and settings management using python type hints
pyLDAvis	Interactive topic model visualization
pymc3	Statistical modeling and probabilistic machine learning
PyMeeus	Python implementation of Jean Meeus astronomical routines
pyparsing	Python parsing
python-dateutil	Extensions to the standard Python datetime module
python-docx	Create and update Microsoft Word .docx files
PyTorch	Tensors and Dynamic neural networks in Python with strong GPU acceleration
pytz	World timezone definitions, modern and historical
regex	Alternative regular expression module, to replace re
requests	HTTP library
requests-oauthlib	OAuthlib authentication support for Requests
rsa	OAuthlib authentication support for Requests
scikit-learn	Machine learning data mining and analysis
scipy	Scientific computing
semver	Python helper for Semantic Versioning
sgmllib3k	Py3k port of sgmllib
six	Python 2 and 3 compatibility library
sklearn	A set of python modules for machine learning and data mining
smart-open	Utilities for streaming large files (S3, HDFS, gzip, bz2, and so forth)
soupsieve	A modern CSS selector implementation for Beautiful Soup
spacy	Large scale natural language processing
spacy-legacy	Legacy registered functions for spaCy backwards compatibility
spacy-loggers	Logging utilities for SpaCy
spectrum	Spectrum Analysis Tools
SQLAlchemy	Database Abstraction Library
srsly	Modern high-performance serialization utilities for Python
statsmodels	Statistical modeling
tempora	Objects and routines pertaining to date and time
tensorboard	TensorBoard lets you watch Tensors Flow
tensorboard-data-server	Fast data loading for TensorBoard
tensorboard-plugin-wit	What-If Tool TensorBoard plugin
tensorflow	Numerical computation using data flow graphs
tensorflow-estimator	What-If Tool TensorBoard plugin
tensorflow-io-gcs-filesystem	TensorFlow IO
termcolor	simple termcolor wrapper
Theano-PyMC	Theano-PyMC
thinc	Practical Machine Learning for NLP
threadpoolctl	Python helpers to limit the number of threads used in the threadpool-backed of common native libraries used for scientific computing and data science
toolz	List processing tools and functional utilities
tqdm	Fast, extensible progress meter
tslearn	A machine learning toolkit dedicated to time-series data
typer	Typer, build great CLIs. Easy to code. Based on Python type hints
typing_extensions	Backported and Experimental Type Hints for Python 3.7+
urllib3	HTTP library with thread-safe connection pooling, file post, and more
wasabi	Lightweight console printing and formatting toolkit
Werkzeug	Comprehensive WSGI web application library
wrapt	Module for decorators, wrappers and monkey patching
xarray	N-D labeled arrays and datasets in Python
xarray-einstats	Stats, linear algebra and einops for xarray
xgboost	Gradient boosting, classifying, ranking
xmltodict	Makes working with XML feel like you are working with JSON
zc.lockfile	Basic inter-process locks
zipp	Backport of pathlib-compatible object wrapper for zip files
tensorflow-gpu	An open source software library for high performance numerical computation
tensorflow	Numerical computation using data flow graphs
keras	An implementation of the Keras API that uses TensorFlow as a backend

Installing a Data Science Package for Python

Before you install a Data Science Package for Python, make sure that your Greenplum Database is running, you have sourced greenplum_path.sh, and that the $MASTER_DATA_DIRECTORY and $GPHOME environment variables are set.

Note
The PyMC3 module depends on Tk. If you want to use PyMC3, you must install the tk OS package on every node in your cluster. For example:

$ sudo yum install tk

Locate the Data Science Package for Python that you built or downloaded.

The file name format of the package is DataSciencePython<pythonversion>-gp6-rhel<n>-x86_64.gppkg. For example, the Data Science Package for Python 2.7 for Redhat 8 file is DataSciencePython2.7-2.0.4-gp6-rhel8_x86_64.gppkg, and the Python 3.9 package is DataSciencePython3.9-3.0.0-gp6-rhel8_x86_64.gppkg.
Copy the package to the Greenplum Database master host.
Follow the instructions in Verifying the Greenplum Database Software Download to verify the integrity of the Greenplum Procedural Languages Python Data Science Package software.
Use the gppkg command to install the package. For example:
```
$ gppkg -i DataSciencePython<pythonversion>-gp6-rhel<n>-x86_64.gppkg
```
gppkg installs the Data Science Package for Python modules on all nodes in your Greenplum Database cluster. The command also updates the PYTHONPATH, PATH, and LD_LIBRARY_PATH environment variables in your greenplum_path.sh file.
Restart Greenplum Database. You must re-source greenplum_path.sh before restarting your Greenplum cluster:
```
$ source /usr/local/greenplum-db/greenplum_path.sh
$ gpstop -r
```

The Data Science Package for Python modules are installed in the following directory for Python 2.7:

$GPHOME/ext/DataSciencePython/lib/python2.7/site-packages/

For Python 3.9 the directory is:

$GPHOME/ext/DataSciencePython/lib/python3.9/site-packages/

Uninstalling a Data Science Package for Python

Use the gppkg utility to uninstall a Data Science Package for Python. You must include the version number in the package name you provide to gppkg.

To determine your Data Science Package for Python version number and remove this package:

$ gppkg -q --all | grep DataSciencePython
DataSciencePython-<version>
$ gppkg -r DataSciencePython-<version>

The command removes the Data Science Package for Python modules from your Greenplum Database cluster. It also updates the PYTHONPATH, PATH, and LD_LIBRARY_PATH environment variables in your greenplum_path.sh file to their pre-installation values.

Re-source greenplum_path.sh and restart Greenplum Database after you remove the Python Data Science Module package:

$ . /usr/local/greenplum-db/greenplum_path.sh
$ gpstop -r

Note
After you uninstall a Data Science Package for Python from your Greenplum Database cluster, any UDFs that you have created that import Python modules installed with this package will return an error.