如何在 vSphere Bitfusion 中安裝 TensorFlow 並執行 TensorFlow 基準

TensorFlow 是一個端對端的機器學習開放原始碼平台。它有一個全面靈活的工具、程式庫和社群資源生態系統，可用於協助研究人員推送 ML 中的先進技術，以及協助開發人員輕鬆建置和部署 ML 支援的應用程式。

TensorFlow 可在一系列工作中使用，但特別側重於深度神經網路的訓練和推理。該平台是基於資料流程和可微分程式設計的符號數學程式庫。

安裝 TensorFlow

TensorFlow 是您搭配 vSphere Bitfusion 使用的機器學習架構。

使用適用於 Python 3 的套件安裝程式 pip3 安裝 TensorFlow。此程序適用於 Ubuntu 20.04、CentOS 8 和 Red Hat Linux 8。

必要條件

確認您已安裝 vSphere Bitfusion 用戶端。
確認您已在 Linux 作業系統上安裝 NVIDIA CUDA 和 NVIDIA cuDNN。

程序

如果在 Ubuntu 20.04 上安裝了 TensorFlow，請安裝其他 Python 資源。
sudo apt-get -y install python3-testresources
透過執行適用於您的 Linux 發行版和版本的命令順序來安裝 pip3。
- Ubuntu 20.04
```
sudo apt-get install -y python3-pip
```
- CentOS 8 和 Red Hat Linux 8
```
sudo yum install -y python36-devel
sudo pip3 install -U pip setuptools
```
使用 pip3 install 命令安裝 TensorFlow。
```
sudo pip3 install tensorflow-gpu==2.4
```

安裝 TensorFlow 基準

TensorFlow 基準是一種開放原始碼 ML 應用程式，旨在測試 TensorFlow 架構的效能。

可以將 TensorFlow 基準建立分支並下載至本機環境。在 Git 中，分支是一條獨立的開發線。

必要條件

確認您已安裝 TensorFlow。

程序

安裝 git。

Ubuntu 20.04
```
sudo apt install -y git
```
CentOS 8 和 Red Hat Linux 8
```
sudo yum -y update
sudo yum install git
```

建立 ~/bitfusion 並將其設為您的工作目錄。
```
mkdir -p bitfusion
cd ~/bitfusion
```
將 Tensorflow 基準的 Git 存放庫複製到您的本機環境。
```
git clone https://github.com/tensorflow/benchmarks.git
```

導覽至基準目錄並列出存放庫的分支。

cd benchmarks
git branch -a

master
remotes/origin/HEAD -> origin/master 
...
remotes/origin/cnn_tf_v1.13_compatible
...
remotes/origin/cnn_tf_v2.1_compatible
...

執行 Git 簽出並列出 TensorFlow 基準存放庫。

git checkout cnn_tf_v2.1_compatible

Branch cnn_tf_v2.1_compatible set up to track remote branch cnn_tf_v2.1_compatible
from origin.
Switched to a new branch ‘cnn_tf_v2.1_compatible’

git branch

cnn_tf_tf_v2.1_compatible
master

執行 TensorFlow 基準

可以執行 TensorFlow 基準以測試 vSphere Bitfusion 和 TensorFlow 部署的效能。

透過執行 TensorFlow 基準並使用各種組態，可以瞭解 ML 工作負載在 vSphere Bitfusion 環境中的回應方式。

程序

若要導覽至 ~/bitfusion/ 目錄，請執行 cd ~/bitfusion/。

若要使用 tf_cnn_benchmarks.py 基準指令碼，請執行 bitfusion run 命令。

透過執行範例中的命令，可以使用單一 GPU 的全部記憶體和 /data 目錄中預先安裝的 ML 資料。

bitfusion run -n 1 -- python3 \
./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--data_format=NCHW \
--batch_size=64 \
--model=resnet50 \
--variable_update=replicated \
--local_parameter_device=gpu \
--nodistortions \
--num_gpus=1 \
--num_batches=100 \
--data_dir=/data \
--data_name=imagenet \
--use_fp16=False

若要使用 tf_cnn_benchmarks.py 基準指令碼，請執行具有 -p 0.67 參數的 bitfusion run 命令。

透過執行範例中的命令，可以使用單一 GPU 的 67% 的記憶體和 /data 目錄中預先安裝的 ML 資料。 -p 0.67 參數可讓您在其餘 33% 的 GPU 記憶體磁碟分割中執行其他工作。

bitfusion run -n 1 -p 0.67 -- python3 \
./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--data_format=NCHW \
--batch_size=64 \
--model=resnet50 \
--variable_update=replicated \
--local_parameter_device=gpu \
--nodistortions \
--num_gpus=1 \
--num_batches=100 \
--data_dir=/data \
--data_name=imagenet \
--use_fp16=False

若要使用 tf_cnn_benchmarks.py 基準指令碼，請使用整合資料執行 bitfusion run 命令。

透過執行範例中的命令，可以使用單一 GPU 的全部記憶體，而不使用預先安裝的 ML 資料。TensorFlow 可以使用一組模擬映像建立整合資料。

bitfusion run -n 1 -- python3 \
./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--data_format=NCHW \
--batch_size=64 \
--model=resnet50 \
--variable_update=replicated \
--local_parameter_device=gpu \
--nodistortions \
--num_gpus=1 \
--num_batches=100 \
--use_fp16=False

結果

您現在可以透過 vSphere Bitfusion 使用遠端伺服器的共用 GPU 執行 TensorFlow 基準。基準支援許多模型和參數，可協助您探索機器學習學科中的廣闊空間。如需詳細資訊，請參閱 VMware vSphere Bitfusion 使用者指南。