Greenplum Database provides a collection of data science-related Python modules that can be used with the Greenplum Database PL/Python language. You can download these modules in .gppkg format from VMware Tanzu Network. Separate modules are provided for Python 2.7 and Python 3.9 development on RHEL7, RHEL8, and Ubuntu platforms.

This section contains the following information:

For information about the Greenplum Database PL/Python Language, see Greenplum PL/Python Language Extension.

Parent topic: Installing Optional Extensions (Tanzu Greenplum)

Data Science Package for Python 2.7 Modules

The following table lists the modules that are provided in the Data Science Package for Python 2.7.

Packages required for Deep Learning features of MADlib are now included. Note that it is not supported for RHEL 6.

Module NameDescription/Used For
atomicwritesAtomic file writes
attrsDeclarative approach for defining class attributes
AutogradGradient-based optimization
backports.functools-lru-cacheBackports functools.lru_cache from Python 3.3
Beautiful SoupNavigating HTML and XML
BlisBlis linear algebra routines
BotoAmazon Web Services library
Boto3The AWS SDK
botocoreLow-level, data-driven core of boto3
BottleneckFast NumPy array functions
Bz2fileRead and write bzip2-compressed files
CertifiProvides Mozilla CA bundle
ChardetUniversal encoding detector for Python 2 and 3
ConfigParserUpdated configparser module
contextlib2Backports and enhancements for the contextlib module
CyclerComposable style cycles
cymemManage calls to calloc/free through Cython
DocutilsPython documentation utilities
enum34Backport of Python 3.4 Enum
FuncsigsPython function signatures from PEP362
functools32Backport of the functools module from Python 3.2.3
funcyFunctional tools focused on practicality
futureCompatibility layer between Python 2 and Python 3
futuresBackport of the concurrent.futures package from Python 3
GensimTopic modeling and document indexing
h5pyRead and write HDF5 files
idnaInternationalized Domain Names in Applications (IDNA)
importlib-metadataRead metadata from Python packages
Jinja2Stand-alone template engine
JMESPathJSON Matching Expressions
JoblibPython functions as pipeline jobs
jsonschemaJSON Schema validation
Keras (RHEL/CentOS 7 only)Deep learning
Keras ApplicationsReference implementations of popular deep learning models
Keras PreprocessingEasy data preprocessing and data augmentation for deep learning models
kiwisolverA fast implementation of the Cassowary constraint solver
LifelinesSurvival analysis
lxmlXML and HTML processing
MarkupSafeSafely add untrusted strings to HTML/XML markup
MatplotlibPython plotting package
mockRolling backport of unittest.mock
more-itertoolsMore routines for operating on iterables, beyond itertools
MurmurHashCython bindings for MurmurHash
NLTKNatural language toolkit
NumExprFast numerical expression evaluator for NumPy
NumPyScientific computing
packagingCore utilities for Python packages
PandasData analysis
pathlib, pathlib2Object-oriented filesystem paths
patsyPackage for describing statistical models and for building design matrices
Pattern-enPart-of-speech tagging
pipTool for installing Python packages
placCommand line arguments parser
pluggyPlugin and hook calling mechanisms
preshedCython hash table that trusts the keys are pre-hashed
protobufProtocol buffers
pyCross-python path, ini-parsing, io, code, log facilities
pyLDAvisInteractive topic model visualization
PyMC3Statistical modeling and probabilistic machine learning
pyparsingPython parsing
pytestTesting framework
python-dateutilExtensions to the standard Python datetime module
pytzWorld timezone definitions, modern and historical
PyYAMLYAML parser and emitter
regexAlternative regular expression module, to replace re
requestsHTTP library
s3transferAmazon S3 transfer manager
scandirDirectory iteration function
scikit-learnMachine learning data mining and analysis
SciPyScientific computing
setuptoolsDownload, build, install, upgrade, and uninstall Python packages
sixPython 2 and 3 compatibility library
smart-openUtilities for streaming large files (S3, HDFS, gzip, bz2, and so forth)
spaCyLarge scale natural language processing
srslyModern high-performance serialization utilities for Python
StatsModelsStatistical modeling
subprocess32Backport of the subprocess module from Python 3
Tensorflow (RHEL/CentOS 7 only)Numerical computation using data flow graphs
TheanoOptimizing compiler for evaluating mathematical expressions on CPUs and GPUs
thincPractical Machine Learning for NLP
tqdmFast, extensible progress meter
urllib3HTTP library with thread-safe connection pooling, file post, and more
wasabiLightweight console printing and formatting toolkit
wcwidthMeasures number of Terminal column cells of wide-character codes
WerkzeugComprehensive WSGI web application library
wheelA built-package format for Python
XGBoostGradient boosting, classifying, ranking
zippBackport of pathlib-compatible object wrapper for zip files

Data Science Package for Python 3.9 Modules

The following table lists the modules that are provided in the Data Science Package for Python 3.9.

Module NameDescription/Used For
absl-pyAbseil Python Common Libraries
arvizExploratory analysis of Bayesian models
astorRead/rewrite/write Python ASTs
astunparseAn AST unparser for Python
autogradEfficiently computes derivatives of numpy code
autograd-gammaautograd compatible approximations to the derivatives of the Gamma-family of functions
backports.csvBackport of Python 3 csv module
beautifulsoup4Screen-scraping library
blisThe Blis BLAS-like linear algebra library, as a self-contained C-extension
cachetoolsExtensible memoizing collections and decorators
catalogueSuper lightweight function registries for your library
certifiPython package for providing Mozilla’s CA Bundle
cffiForeign Function Interface for Python calling C code
cftimeTime-handling functionality from netcdf4-python
charset-normalizerThe Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet.
cherootHighly-optimized, pure-python HTTP server
CherryPyObject-Oriented HTTP framework
clickComposable command line interface toolkit
convertdateConverts between Gregorian dates and other calendar systems
cryptographyA set of functions useful in cryptography and linear algebra
cyclerComposable style cycles
cymemManage calls to calloc/free through Cython
CythonThe Cython compiler for writing C extensions for the Python language
deprecatPython @deprecat decorator to deprecate old python classes, functions or methods
dillserialize all of python
fastprogressA nested progress with plotting options for fastai
feedparserUniversal feed parser, handles RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0 feeds
filelockA platform independent file lock
flatbuffersThe FlatBuffers serialization format for Python
fonttoolsTools to manipulate font files
formulaicAn implementation of Wilkinson formulas
funcyA fancy and practical functional tools
futureClean single-source support for Python 3 and 2
gastPython AST that abstracts the underlying Python version
gensimPython framework for fast Vector Space Modelling
gluontsGluonTS is a Python toolkit for probabilistic time series modeling, built around MXNet
google-authGoogle Authentication Library
google-auth-oauthlibGoogle Authentication Library
google-pastapasta is an AST-based Python refactoring library
graphvizSimple Python interface for Graphviz
greenletLightweight in-process concurrent programming
grpcioHTTP/2-based RPC framework
h5pyRead and write HDF5 files from Python
hijri-converterAccurate Hijri-Gregorian dates converter based on the Umm al-Qura calendar
holidaysGenerate and work with holidays in Python
idnaInternationalized Domain Names in Applications (IDNA)
importlib-metadataRead metadata from Python packages
interface-metaProvides a convenient way to expose an extensible API with enforced method signatures and consistent documentation
jaraco.classesUtility functions for Python class constructs
jaraco.collectionsCollection objects similar to those in stdlib by jaraco
jaraco.contextContext managers by jaraco
jaraco.functoolsFunctools like those found in stdlib
jaraco.textModule for text manipulation
Jinja2A very fast and expressive template engine
joblibLightweight pipelining with Python functions
kerasDeep learning for humans
Keras-PreprocessingEasy data preprocessing and data augmentation for deep learning models
kiwisolverA fast implementation of the Cassowary constraint solver
korean-lunar-calendarKorean Lunar Calendar
langcodesTools for labeling human languages with IETF language tags
libclangClang Python Bindings, mirrored from the official LLVM repo
lifelinesSurvival analysis in Python, including Kaplan Meier, Nelson Aalen and regression
llvmlitelightweight wrapper around basic LLVM functionality
lxmlPowerful and Pythonic XML processing library combining libxml2/libxslt with the ElementTree API
MarkdownPython implementation of Markdown
MarkupSafeSafely add untrusted strings to HTML/XML markup
matplotlibPython plotting package
more-itertoolsMore routines for operating on iterables, beyond itertools
murmurhashCython bindings for MurmurHash
mxnetAn ultra-scalable deep learning framework
mysqlclientPython interface to MySQL
netCDF4Provides an object-oriented python interface to the netCDF version 4 library
nltkNatural language toolkit
numbaCompiling Python code using LLVM
numexprFast numerical expression evaluator for NumPy
numpyScientific computing
oauthlibA generic, spec-compliant, thorough implementation of the OAuth request-signing logic
opt-einsumOptimizing numpys einsum function
packagingCore utilities for Python packages
pandasData analysis
pathypathlib.Path subclasses for local and cloud bucket storage
patsyPackage for describing statistical models and for building design matrices
PatternWeb mining module for Python
pdfminer.sixPDF parser and analyzer
PillowPython Imaging Library
pmdarimaPython’s forecast::auto.arima equivalent
portendTCP port monitoring and discovery
preshedCython hash table that trusts the keys are pre-hashed
prophetAutomatic Forecasting Procedure
protobufProtocol buffers
psycopg2PostgreSQL database adapter for Python
psycopg2-binarypsycopg2 - Python-PostgreSQL Database Adapter
pyasn1ASN.1 types and codecs
pyasn1-modulespyasn1-modules
pycparserC parser in Python
pydanticData validation and settings management using python type hints
pyLDAvisInteractive topic model visualization
pymc3Statistical modeling and probabilistic machine learning
PyMeeusPython implementation of Jean Meeus astronomical routines
pyparsingPython parsing
python-dateutilExtensions to the standard Python datetime module
python-docxCreate and update Microsoft Word .docx files
PyTorchTensors and Dynamic neural networks in Python with strong GPU acceleration
pytzWorld timezone definitions, modern and historical
regexAlternative regular expression module, to replace re
requestsHTTP library
requests-oauthlibOAuthlib authentication support for Requests
rsaOAuthlib authentication support for Requests
scikit-learnMachine learning data mining and analysis
scipyScientific computing
semverPython helper for Semantic Versioning
sgmllib3kPy3k port of sgmllib
sixPython 2 and 3 compatibility library
sklearnA set of python modules for machine learning and data mining
smart-openUtilities for streaming large files (S3, HDFS, gzip, bz2, and so forth)
soupsieveA modern CSS selector implementation for Beautiful Soup
spacyLarge scale natural language processing
spacy-legacyLegacy registered functions for spaCy backwards compatibility
spacy-loggersLogging utilities for SpaCy
spectrumSpectrum Analysis Tools
SQLAlchemyDatabase Abstraction Library
srslyModern high-performance serialization utilities for Python
statsmodelsStatistical modeling
temporaObjects and routines pertaining to date and time
tensorboardTensorBoard lets you watch Tensors Flow
tensorboard-data-serverFast data loading for TensorBoard
tensorboard-plugin-witWhat-If Tool TensorBoard plugin
tensorflowNumerical computation using data flow graphs
tensorflow-estimatorWhat-If Tool TensorBoard plugin
tensorflow-io-gcs-filesystemTensorFlow IO
termcolorsimple termcolor wrapper
Theano-PyMCTheano-PyMC
thincPractical Machine Learning for NLP
threadpoolctlPython helpers to limit the number of threads used in the threadpool-backed of common native libraries used for scientific computing and data science
toolzList processing tools and functional utilities
tqdmFast, extensible progress meter
tslearnA machine learning toolkit dedicated to time-series data
typerTyper, build great CLIs. Easy to code. Based on Python type hints
typing_extensionsBackported and Experimental Type Hints for Python 3.7+
urllib3HTTP library with thread-safe connection pooling, file post, and more
wasabiLightweight console printing and formatting toolkit
WerkzeugComprehensive WSGI web application library
wraptModule for decorators, wrappers and monkey patching
xarrayN-D labeled arrays and datasets in Python
xarray-einstatsStats, linear algebra and einops for xarray
xgboostGradient boosting, classifying, ranking
xmltodictMakes working with XML feel like you are working with JSON
zc.lockfileBasic inter-process locks
zippBackport of pathlib-compatible object wrapper for zip files
tensorflow-gpuAn open source software library for high performance numerical computation
tensorflowNumerical computation using data flow graphs
kerasAn implementation of the Keras API that uses TensorFlow as a backend

Installing a Data Science Package for Python

Before you install a Data Science Package for Python, make sure that your Greenplum Database is running, you have sourced greenplum_path.sh, and that the $MASTER_DATA_DIRECTORY and $GPHOME environment variables are set.

Note

The PyMC3 module depends on Tk. If you want to use PyMC3, you must install the tk OS package on every node in your cluster. For example:

  1. $ sudo yum install tk
  1. Locate the Data Science Package for Python that you built or downloaded.

    The file name format of the package is DataSciencePython<pythonversion>-gp6-rhel<n>-x86_64.gppkg. For example, the Data Science Package for Python 2.7 for Redhat 8 file is DataSciencePython2.7-2.0.4-gp6-rhel8_x86_64.gppkg, and the Python 3.9 package is DataSciencePython3.9-3.0.0-gp6-rhel8_x86_64.gppkg.

  2. Copy the package to the Greenplum Database master host.

  3. Follow the instructions in Verifying the Greenplum Database Software Download to verify the integrity of the Greenplum Procedural Languages Python Data Science Package software.

  4. Use the gppkg command to install the package. For example:

    1. $ gppkg -i DataSciencePython<pythonversion>-gp6-rhel<n>-x86_64.gppkg

    gppkg installs the Data Science Package for Python modules on all nodes in your Greenplum Database cluster. The command also updates the PYTHONPATH, PATH, and LD_LIBRARY_PATH environment variables in your greenplum_path.sh file.

  5. Restart Greenplum Database. You must re-source greenplum_path.sh before restarting your Greenplum cluster:

    1. $ source /usr/local/greenplum-db/greenplum_path.sh
    2. $ gpstop -r

The Data Science Package for Python modules are installed in the following directory for Python 2.7:

  1. $GPHOME/ext/DataSciencePython/lib/python2.7/site-packages/

For Python 3.9 the directory is:

  1. $GPHOME/ext/DataSciencePython/lib/python3.9/site-packages/

Uninstalling a Data Science Package for Python

Use the gppkg utility to uninstall a Data Science Package for Python. You must include the version number in the package name you provide to gppkg.

To determine your Data Science Package for Python version number and remove this package:

  1. $ gppkg -q --all | grep DataSciencePython
  2. DataSciencePython-<version>
  3. $ gppkg -r DataSciencePython-<version>

The command removes the Data Science Package for Python modules from your Greenplum Database cluster. It also updates the PYTHONPATH, PATH, and LD_LIBRARY_PATH environment variables in your greenplum_path.sh file to their pre-installation values.

Re-source greenplum_path.sh and restart Greenplum Database after you remove the Python Data Science Module package:

  1. $ . /usr/local/greenplum-db/greenplum_path.sh
  2. $ gpstop -r

Note

After you uninstall a Data Science Package for Python from your Greenplum Database cluster, any UDFs that you have created that import Python modules installed with this package will return an error.