TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications.

TensorFlow can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. The platform is a symbolic math library based on dataflow and differentiable programming.

Install TensorFlow

TensorFlow is the machine learning framework you use with vSphere Bitfusion.

Install TensorFlow by using pip3, which is the package installer for Python 3. The procedure is applicable for Ubuntu 20.04, CentOS 8, and Red Hat Linux 8.

Prerequisites

  • Verify you have installed a vSphere Bitfusion client.
  • Verify you have installed NVIDIA CUDA and NVIDIA cuDNN on your Linux operating system.

Procedure

  1. If you install TensorFlow on Ubuntu 20.04, install additional Python resources.
    sudo apt-get -y install python3-testresources
  2. Install pip3 by running the command sequence for your Linux distribution and version.
    • Ubuntu 20.04
      sudo apt-get install -y python3-pip
    • CentOS 8 and Red Hat Linux 8
      sudo yum install -y python36-devel
      sudo pip3 install -U pip setuptools
  3. Install TensorFlow by using the pip3 install command.
    sudo pip3 install tensorflow-gpu==2.4

Install TensorFlow BenchMarks

The TensorFlow benchmarks are open-source ML applications designed to test the performance of the TensorFlow framework.

You branch and download the TensorFlow benchmarks to your local environment. In Git, a branch is a separate line of development.

Prerequisites

Verify that you have installed TensorFlow.

Procedure

  1. Install git.
    • Ubuntu 20.04
      sudo apt install -y git
    • CentOS 8 and Red Hat Linux 8
      sudo yum -y update
      sudo yum install git
  2. Create and make ~/bitfusion your working directory.
    mkdir -p bitfusion
    cd ~/bitfusion
  3. Clone the Git repository of Tensorflow benchmarks to your local environment.
    git clone https://github.com/tensorflow/benchmarks.git
  4. Navigate to the benchmarks directory and list branches of the repository.
    cd benchmarks
    git branch -a
    master
    remotes/origin/HEAD -> origin/master 
    ...
    remotes/origin/cnn_tf_v1.13_compatible
    ...
    remotes/origin/cnn_tf_v2.1_compatible
    ...
  5. Do a Git checkout and list the TensorFlow benchmarks repository.
    git checkout cnn_tf_v2.1_compatible
    Branch cnn_tf_v2.1_compatible set up to track remote branch cnn_tf_v2.1_compatible
    from origin.
    Switched to a new branch ‘cnn_tf_v2.1_compatible’
    git branch
    cnn_tf_tf_v2.1_compatible
    master

Run TensorFlow Benchmarks

You can run the TensorFlow benchmarks to test the performance of your vSphere Bitfusion and TensorFlow deployment.

By running the TensorFlow benchmarks and using various configurations, you can understand how ML workloads respond in your vSphere Bitfusion environment.

Procedure

  1. To navigate to the ~/bitfusion/ directory, run cd ~/bitfusion/.
  2. To use the tf_cnn_benchmarks.py benchmark script, run the bitfusion run command.
    By running the commands in the example, you use the entire memory of a single GPU and pre-installed ML data in the /data directory.
    bitfusion run -n 1 -- python3 \
    ./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
    --data_format=NCHW \
    --batch_size=64 \
    --model=resnet50 \
    --variable_update=replicated \
    --local_parameter_device=gpu \
    --nodistortions \
    --num_gpus=1 \
    --num_batches=100 \
    --data_dir=/data \
    --data_name=imagenet \
    --use_fp16=False
  3. To use the tf_cnn_benchmarks.py benchmark script, run the bitfusion run command with the -p 0.67 parameter.
    By running the commands in the example, you use 67% of the memory of a single GPU and pre-installed ML data in the /data directory. The -p 0.67 parameter lets you run another job in the remaining 33% of the GPU's memory partition.
    bitfusion run -n 1 -p 0.67 -- python3 \
    ./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
    --data_format=NCHW \
    --batch_size=64 \
    --model=resnet50 \
    --variable_update=replicated \
    --local_parameter_device=gpu \
    --nodistortions \
    --num_gpus=1 \
    --num_batches=100 \
    --data_dir=/data \
    --data_name=imagenet \
    --use_fp16=False
  4. To use the tf_cnn_benchmarks.py benchmark script, run the bitfusion run command with synthesized data.
    By running the commands in the example, you use the entire memory of a single GPU and no pre-installed ML data. TensorFlow can create synthesized data with a pretend set of images.
    bitfusion run -n 1 -- python3 \
    ./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
    --data_format=NCHW \
    --batch_size=64 \
    --model=resnet50 \
    --variable_update=replicated \
    --local_parameter_device=gpu \
    --nodistortions \
    --num_gpus=1 \
    --num_batches=100 \
    --use_fp16=False

Results

You can now run TensorFlow benchmarks with vSphere Bitfusion with shared GPUs from a remote server. The benchmarks support many models and parameters to help you explore a large space within the machine learning discipline. For more information, see VMware vSphere Bitfusion User Guide.