You can run the TensorFlow benchmarks to test the performance of your vSphere Bitfusion and TensorFlow deployment.

By running the TensorFlow benchmarks and using various configurations, you can understand how ML workloads respond in your vSphere Bitfusion environment.

Procedure

  1. To navigate to the ~/bitfusion/ directory, run cd ~/bitfusion/.
  2. To use the tf_cnn_benchmarks.py benchmark script, run the bitfusion run command.
    By running the commands in the example, you use the entire memory of a single GPU and pre-installed ML data in the /data directory.
    bitfusion run -n 1 -- python3 \
    ./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
    --data_format=NCHW \
    --batch_size=64 \
    --model=resnet50 \
    --variable_update=replicated \
    --local_parameter_device=gpu \
    --nodistortions \
    --num_gpus=1 \
    --num_batches=100 \
    --data_dir=/data \
    --data_name=imagenet \
    --use_fp16=False
  3. To use the tf_cnn_benchmarks.py benchmark script, run the bitfusion run command with the -p 0.67 parameter.
    By running the commands in the example, you use 67% of the memory of a single GPU and pre-installed ML data in the /data directory. The -p 0.67 parameter lets you run another job in the remaining 33% of the GPU's memory partition.
    bitfusion run -n 1 -p 0.67 -- python3 \
    ./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
    --data_format=NCHW \
    --batch_size=64 \
    --model=resnet50 \
    --variable_update=replicated \
    --local_parameter_device=gpu \
    --nodistortions \
    --num_gpus=1 \
    --num_batches=100 \
    --data_dir=/data \
    --data_name=imagenet \
    --use_fp16=False
  4. To use the tf_cnn_benchmarks.py benchmark script, run the bitfusion run command with synthesized data.
    By running the commands in the example, you use the entire memory of a single GPU and no pre-installed ML data. TensorFlow can create synthesized data with a pretend set of images.
    bitfusion run -n 1 -- python3 \
    ./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
    --data_format=NCHW \
    --batch_size=64 \
    --model=resnet50 \
    --variable_update=replicated \
    --local_parameter_device=gpu \
    --nodistortions \
    --num_gpus=1 \
    --num_batches=100 \
    --use_fp16=False

Results

You can now run TensorFlow benchmarks with vSphere Bitfusion with shared GPUs from a remote server. The benchmarks support many models and parameters to help you explore a large space within the machine learning discipline. For more information, see VMware vSphere Bitfusion User Guide.