Greenplum Database provides a collection of data science-related Python modules that can be used with the Greenplum Database PL/Python language. You can download these modules in .gppkg
format from Broadcom Support Portal under the desired Greemplum release.
NoteFor more information about download prerequisites, troubleshooting, and instructions, see Download Broadcom products and software.
This section contains the following information:
For information about the Greenplum Database PL/Python Language, see Greenplum PL/Python Language Extension.
Parent topic: Installing Optional Extensions (VMware Greenplum)
The following table lists the modules that are provided in the Data Science Package for Python 3.11.
Module Name | Description/Used For |
---|---|
absl-py | Abseil Python Common Libraries |
arviz | Exploratory analysis of Bayesian models |
astor | Read/rewrite/write Python ASTs |
astunparse | An AST unparser for Python |
autograd | Efficiently computes derivatives of numpy code |
autograd-gamma | autograd compatible approximations to the derivatives of the Gamma-family of functions |
backports.csv | Backport of Python 3 csv module |
beautifulsoup4 | Screen-scraping library |
blis | The Blis BLAS-like linear algebra library, as a self-contained C-extension |
cachetools | Extensible memoizing collections and decorators |
catalogue | Super lightweight function registries for your library |
catboost | A high-performance open source library for gradient boosting on decision trees |
certifi | Python package for providing Mozilla's CA Bundle |
cffi | Foreign Function Interface for Python calling C code |
cftime | Time-handling functionality from netcdf4-python |
charset-normalizer | The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet. |
cheroot | Highly-optimized, pure-python HTTP server |
CherryPy | Object-Oriented HTTP framework |
click | Composable command line interface toolkit |
convertdate | Converts between Gregorian dates and other calendar systems |
cryptography | A set of functions useful in cryptography and linear algebra |
cycler | Composable style cycles |
cymem | Manage calls to calloc/free through Cython |
Cython | The Cython compiler for writing C extensions for the Python language |
datasets | HuggingFace community-driven open-source library of datasets |
deprecat | Python @deprecat decorator to deprecate old python classes, functions or methods |
dill | serialize all of python |
fastprogress | A nested progress with plotting options for fastai |
feedparser | Universal feed parser, handles RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0 feeds |
filelock | A platform independent file lock |
flatbuffers | The FlatBuffers serialization format for Python |
fonttools | Tools to manipulate font files |
formulaic | An implementation of Wilkinson formulas |
funcy | A fancy and practical functional tools |
future | Clean single-source support for Python 3 and 2 |
gast | Python AST that abstracts the underlying Python version |
gensim | Python framework for fast Vector Space Modelling |
gluonts | GluonTS is a Python toolkit for probabilistic time series modeling, built around MXNet |
google-auth | Google Authentication Library |
google-auth-oauthlib | Google Authentication Library |
google-pasta | pasta is an AST-based Python refactoring library |
graphviz | Simple Python interface for Graphviz |
greenlet | Lightweight in-process concurrent programming |
grpcio | HTTP/2-based RPC framework |
h5py | Read and write HDF5 files from Python |
hijri-converter | Accurate Hijri-Gregorian dates converter based on the Umm al-Qura calendar |
holidays | Generate and work with holidays in Python |
idna | Internationalized Domain Names in Applications (IDNA) |
importlib-metadata | Read metadata from Python packages |
InstructorEmbedding | Text embedding tool |
interface-meta | Provides a convenient way to expose an extensible API with enforced method signatures and consistent documentation |
jaraco.classes | Utility functions for Python class constructs |
jaraco.collections | Collection objects similar to those in stdlib by jaraco |
jaraco.context | Context managers by jaraco |
jaraco.functools | Functools like those found in stdlib |
jaraco.text | Module for text manipulation |
Jinja2 | A very fast and expressive template engine |
joblib | Lightweight pipelining with Python functions |
keras | Deep learning for humans |
Keras-Preprocessing | Easy data preprocessing and data augmentation for deep learning models |
kiwisolver | A fast implementation of the Cassowary constraint solver |
korean-lunar-calendar | Korean Lunar Calendar |
langcodes | Tools for labeling human languages with IETF language tags |
libclang | Clang Python Bindings, mirrored from the official LLVM repo |
lifelines | Survival analysis in Python, including Kaplan Meier, Nelson Aalen and regression |
lime | Local Interpretable Model-Agnostic Explanations for machine learning classifiers |
llvmlite | lightweight wrapper around basic LLVM functionality |
lxml | Powerful and Pythonic XML processing library combining libxml2/libxslt with the ElementTree API |
Markdown | Python implementation of Markdown |
MarkupSafe | Safely add untrusted strings to HTML/XML markup |
matplotlib | Python plotting package |
more-itertools | More routines for operating on iterables, beyond itertools |
murmurhash | Cython bindings for MurmurHash |
mxnet | An ultra-scalable deep learning framework |
mysqlclient | Python interface to MySQL |
netCDF4 | Provides an object-oriented python interface to the netCDF version 4 library |
nltk | Natural language toolkit |
numba | Compiling Python code using LLVM |
numexpr | Fast numerical expression evaluator for NumPy |
numpy | Scientific computing |
oauthlib | A generic, spec-compliant, thorough implementation of the OAuth request-signing logic |
opt-einsum | Optimizing numpys einsum function |
orjson | Fast, correct Python JSON library supporting dataclasses, datetimes, and numpy |
packaging | Core utilities for Python packages |
pandas | Data analysis |
pathy | pathlib.Path subclasses for local and cloud bucket storage |
patsy | Package for describing statistical models and for building design matrices |
Pattern | Web mining module for Python |
pdfminer.six | PDF parser and analyzer |
Pillow | Python Imaging Library |
pmdarima | Python's forecast::auto.arima equivalent |
portend | TCP port monitoring and discovery |
preshed | Cython hash table that trusts the keys are pre-hashed |
prophet | Automatic Forecasting Procedure |
protobuf | Protocol buffers |
psycopg2 | PostgreSQL database adapter for Python |
pyasn1 | ASN.1 types and codecs |
pyasn1-modules | pyasn1-modules |
pycparser | C parser in Python |
pydantic | Data validation and settings management using python type hints |
pyLDAvis | Interactive topic model visualization |
pymc3 | Statistical modeling and probabilistic machine learning |
PyMeeus | Python implementation of Jean Meeus astronomical routines |
pyparsing | Python parsing |
python-dateutil | Extensions to the standard Python datetime module |
python-docx | Create and update Microsoft Word .docx files |
PyTorch | Tensors and Dynamic neural networks in Python with strong GPU acceleration |
pytz | World timezone definitions, modern and historical |
PyXB-X | To generate Python code for classes that correspond to data structures defined by XMLSchema |
regex | Alternative regular expression module, to replace re |
requests | HTTP library |
requests-oauthlib | OAuthlib authentication support for Requests |
rouge | Full Python ROUGE Score Implementation (not a wrapper) |
rsa | OAuthlib authentication support for Requests |
sacrebleu | Hassle-free computation of shareable, comparable, and reproducible BLEU, chrF, and TER scores |
scikit-learn | Machine learning data mining and analysis |
scipy | Scientific computing |
semver | Python helper for Semantic Versioning |
sentence_transformers | Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co. |
sgmllib3k | Py3k port of sgmllib |
shap | A unified approach to explain the output of any machine learning model |
six | Python 2 and 3 compatibility library |
sklearn | A set of python modules for machine learning and data mining |
smart-open | Utilities for streaming large files (S3, HDFS, gzip, bz2, and so forth) |
soupsieve | A modern CSS selector implementation for Beautiful Soup |
spacy | Large scale natural language processing |
spacy-legacy | Legacy registered functions for spaCy backwards compatibility |
spacy-loggers | Logging utilities for SpaCy |
spectrum | Spectrum Analysis Tools |
SQLAlchemy | Database Abstraction Library |
srsly | Modern high-performance serialization utilities for Python |
statsmodels | Statistical modeling |
tempora | Objects and routines pertaining to date and time |
tensorboard | TensorBoard lets you watch Tensors Flow |
tensorboard-data-server | Fast data loading for TensorBoard |
tensorboard-plugin-wit | What-If Tool TensorBoard plugin |
tensorflow | Numerical computation using data flow graphs |
tensorflow-estimator | What-If Tool TensorBoard plugin |
tensorflow-io-gcs-filesystem | TensorFlow IO |
termcolor | simple termcolor wrapper |
Theano-PyMC | Theano-PyMC |
thinc | Practical Machine Learning for NLP |
threadpoolctl | Python helpers to limit the number of threads used in the threadpool-backed of common native libraries used for scientific computing and data science |
toolz | List processing tools and functional utilities |
tqdm | Fast, extensible progress meter |
transformers | State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow |
tslearn | A machine learning toolkit dedicated to time-series data |
typer | Typer, build great CLIs. Easy to code. Based on Python type hints |
typing_extensions | Backported and Experimental Type Hints for Python 3.7+ |
urllib3 | HTTP library with thread-safe connection pooling, file post, and more |
wasabi | Lightweight console printing and formatting toolkit |
Werkzeug | Comprehensive WSGI web application library |
wrapt | Module for decorators, wrappers and monkey patching |
xarray | N-D labeled arrays and datasets in Python |
xarray-einstats | Stats, linear algebra and einops for xarray |
xgboost | Gradient boosting, classifying, ranking |
xmltodict | Makes working with XML feel like you are working with JSON |
zc.lockfile | Basic inter-process locks |
zipp | Backport of pathlib-compatible object wrapper for zip files |
tensorflow | Numerical computation using data flow graphs |
keras | An implementation of the Keras API that uses TensorFlow as a backend |
Before you install a Data Science Package for Python, make sure that your Greenplum Database is running, you have sourced greenplum_path.sh
, and that the $COORDINATOR_DATA_DIRECTORY
and $GPHOME
environment variables are set.
NoteThe
PyMC3
module depends onTk
. If you want to usePyMC3
, you must install thetk
OS package on every node in your cluster. For example:
$ sudo yum install tk
Locate the Data Science Package for Python that you built or downloaded.
The file name format of the package is DataSciencePython<pythonversion>-<data-science-bundle-version>-gp7-rhel<n>-x86_64.gppkg
. For example: DataSciencePython3.11-3.0.0-gp7-rhel8_x86_64.gppkg
.
Copy the package to the Greenplum Database coordinator host.
Follow the instructions in Verifying the Greenplum Database Software Download to verify the integrity of the Greenplum Procedural Languages Python Data Science Package software.
Use the gppkg
command to install the package. For example:
$ gppkg install DataSciencePython3.11-1.2.0-gp7-el8_x86_64.gppkg
gppkg
installs the Data Science Package for Python modules on all nodes in your Greenplum Database cluster. The command also updates the PYTHONPATH
, PATH
, and LD_LIBRARY_PATH
environment variables in your greenplum_path.sh
file.
Restart Greenplum Database. You must re-source greenplum_path.sh
before restarting your Greenplum cluster:
$ source /usr/local/greenplum-db/greenplum_path.sh
$ gpstop -r
The Data Science Package for Python modules are installed in the following directory:
$GPHOME/ext/DataSciencePython/lib/python3.11/site-packages/
Use the gppkg
utility to uninstall a Data Science Package for Python. You must include the version number in the package name you provide to gppkg
.
To determine your Data Science Package for Python version number and remove this package:
$ gppkg query | grep DataSciencePython
DataSciencePython-<version>
$ gppkg remove DataSciencePython-<version>
The command removes the Data Science Package for Python modules from your Greenplum Database cluster. It also updates the PYTHONPATH
, PATH
, and LD_LIBRARY_PATH
environment variables in your greenplum_path.sh
file to their pre-installation values.
Re-source greenplum_path.sh
and restart Greenplum Database after you remove the Python Data Science Module package:
$ . /usr/local/greenplum-db/greenplum_path.sh
$ gpstop -r
NoteAfter you uninstall a Data Science Package for Python from your Greenplum Database cluster, any UDFs that you have created that import Python modules installed with this package will return an error.