Python Data Science Module Package

Greenplum Database provides a collection of data science-related Python modules that can be used with the Greenplum Database PL/Python language. You can download these modules in .gppkg format from Broadcom Support Portal under the specific Greenplum release.

Note

For more information about download prerequisites, troubleshooting, and instructions, see Download Broadcom products and software.

This section contains the following information:

For information about the Greenplum Database PL/Python Language, see Greenplum PL/Python Language Extension.

Parent topic: Installing Optional Extensions

Python Data Science Modules

Modules provided in the Python Data Science package include:

Module Name Description/Used For
Beautiful Soup Navigating HTML and XML
Gensim Topic modeling and document indexing
Keras (RHEL/CentOS 7 only) Deep learning
Lifelines Survival analysis
lxml XML and HTML processing
NLTK Natural language toolkit
NumPy Scientific computing
Pandas Data analysis
Pattern-en Part-of-speech tagging
pyLDAvis Interactive topic model visualization
PyMC3 Statistical modeling and probabilistic machine learning
scikit-learn Machine learning data mining and analysis
SciPy Scientific computing
spaCy Large scale natural language processing
StatsModels Statistical modeling
Tensorflow (RHEL/CentOS 7 only) Numerical computation using data flow graphs
XGBoost Gradient boosting, classifying, ranking

Installing the Python Data Science Module Package

Before you install the Python Data Science Module package, make sure that your Greenplum Database is running, you have sourced greenplum_path.sh, and that the $MASTER_DATA_DIRECTORY and $GPHOME environment variables are set.

Note: The PyMC3 module depends on Tk. If you want to use PyMC3, you must install the tk OS package on every node in your cluster. For example:

$ yum install tk

  1. Locate the Python Data Science module package that you built or downloaded.

    The file name format of the package is DataSciencePython-<version>-relhel<N>-x86_64.gppkg.

  2. Copy the package to the Greenplum Database master host.

  3. Follow the instructions in Verifying the Greenplum Database Software Download to verify the integrity of the Greenplum Procedural Languages Python Data Science Package software.

  4. Use the gppkg command to install the package. For example:

    $ gppkg -i DataSciencePython-<version>-relhel<N>-x86_64.gppkg
    

    gppkg installs the Python Data Science modules on all nodes in your Greenplum Database cluster. The command also updates the PYTHONPATH, PATH, and LD_LIBRARY_PATH environment variables in your greenplum_path.sh file.

  5. Restart Greenplum Database. You must re-source greenplum_path.sh before restarting your Greenplum cluster:

    $ source /usr/local/greenplum-db/greenplum_path.sh
    $ gpstop -r
    

The Greenplum Database Python Data Science Modules are installed in the following directory:

$GPHOME/ext/DataSciencePython/lib/python2.7/site-packages/

Uninstalling the Python Data Science Module Package

Use the gppkg utility to uninstall the Python Data Science Module package. You must include the version number in the package name you provide to gppkg.

To determine your Python Data Science Module package version number and remove this package:

$ gppkg -q --all | grep DataSciencePython
DataSciencePython-<version>
$ gppkg -r DataSciencePython-<version>

The command removes the Python Data Science modules from your Greenplum Database cluster. It also updates the PYTHONPATH, PATH, and LD_LIBRARY_PATH environment variables in your greenplum_path.sh file to their pre-installation values.

Re-source greenplum_path.sh and restart Greenplum Database after you remove the Python Data Science Module package:

$ . /usr/local/greenplum-db/greenplum_path.sh
$ gpstop -r 

Note: When you uninstall the Python Data Science Module package from your Greenplum Database cluster, any UDFs that you have created that import Python modules installed with this package will return an error.

check-circle-line exclamation-circle-line close-line
Scroll to top icon