Updated on: 08 SEP 2020
VMware vSphere Bitfusion 2.0.2 | 08 SEP 2020 | Build 4
Several fixes, an update to support the Caffee framework, and a known issue when using the
Updated on: 04 AUG 2020
VMware vSphere Bitfusion 2.0.1 | 04 AUG 2020 | Build 3
A fix and a few small updates.
Updated on: 09 JUL 2020
VMware vSphere Bitfusion 2.0.0 | 09 JUL 2020 | Build 11
Check for additions and updates to these release notes.
What's in the Release NotesThe release notes cover the following topics:
What's New in vSphere Bitfusion
VMware vSphere Bitfusion shares accelerators such as GPUs to provide a pool of shared network-accessible resources capable of supporting resource-intensive artificial intelligence (AI) and machine learning (ML) workloads. vSphere Bitfusion works across AI frameworks, clouds, networks, and in environments such as virtual machines, containers, and notebooks. Here are some of the features highlighted in the vSphere Bitfusion releases.
vSphere Bitfusion 2.0.2
- Support for applications that register segmentation fault (SIGSEGV) handlers. This fixes an issue observed when using Caffee, a framework for deep learning.
- Fixes health check issues when using Paravirtual RDMA (PVRDMA).
- Fixes a potential freeze, or hang, when using the vSphere Bitfusion client.
- Fixes a race condition that can occur when updating vSphere Bitfusion cluster statistics.
vSphere Bitfusion 2.0.1
- Fix for detecting license correctly for vSphere >= 7.0b
- Updated NVIDIA driver to version 440.95.01
- Support for multiple datacenters within a vCenter Server instance. (This does not mean that a Bitfusion Server can be supported by multiple vCenter Server instances.)
vSphere Bitfusion 2.0.0
- Dynamic, remote sharing. AI/ML applications do not need to be modified or recompiled. They run on client machines as-is, but their API calls to access GPUs are intercepted and sent for execution on Bitfusion server machines that house the physical GPUs. GPUs are allocated as needed and returned to the pool when a session or application completes.
- Partial sharing. GPU memory can be partitioned into slices of arbitrary, different sizes, then allocated to different clients for concurrent use.
- vCenter Server now hosts the vSphere Bitfusion management and analytic capabilities.
The VMware Product Interoperability Matrix provides details about the compatibility of vSphere Bitfusion with various versions of VMware vSphere components, including ESXi, VMware vCenter Server, the vSphere Client, and optional VMware products.
To view a list of hardware devices devices that are compatible with vSphere Bitfusion 2.0 see the VMware Compatibility Guide.
Before You Begin
The Installation Guide covers the prerequisites, but a couple of items will be emphasized here. Bitfusion servers are VMs (deployed as OVAs) on physical servers with GPUs passed through.
- NVIDIA allows its commercial (data center class) GPUs to be passed through
Bitfusion clients run the AI/ML applications sharing the GPUs on the servers across the network.
- The minimum recommended network bandwidth is 10 Gbps
- The maximum recommended network latency between the client and servers is 50 microseconds
- vSphere Bitfusion supports TCP and RoCE tranports
- IPv6 is not supported
- Do not attach two network adapters to the same network
- The Bitfusion servers must be deployed on ESXi 7 hosts, require Enterprise Plus licensing and a Bitfusion Add-on for every two GPUs.
Just for your information:
ulimits are increased for members of the bitfusion group on client machines. The file
/etc/security/limits.d/bitfusion-limits.confis automatically installed on the client by the client package. It contains the following settings:
# max number of open files @bitfusion soft nofile 100000 @bitfusion hard nofile 100000
# Unlimited locked-in-memory address space @bitfusion soft memlock unlimited @bitfusion hard memlock unlimited
# Unlimited max resident set size @bitfusion soft rss unlimited @bitfusion hard rss unlimited
Instructions for removing servers from the Bitfusion cluster:
- Delete server from the vSphere Bitfusion GUI page in vCenter Server.
- Wait 1~2 min for server to disconnect and update its cluster membership.
- Power off VM.
However when VMs/hosts are accidentally offline, use the following command removing the server is more difficult (because of the distributed database software). Use the following command instead:
- Log into a surviving server VM, run `bitfusion removenode`. It should automatically detect downed nodes and update cluster membership accordingly.
Open Source Components for vSphere Bitfusion 2.0
The copyright statements and licenses applicable to the open source software components distributed in vSphere Bitfusion 2.0 are available at http://www.vmware.com. You need to log in to your My VMware account. Then, from the Downloads menu, select vSphere Bitfusion. On the Open Source tab, you can also download the source files for any GPL, LGPL, or other similar licenses that require the source code or modifications to source code to be made available for the most recent available release of vSphere Bitfusion.
vSphere Bitfusion 2.0.2
bitfusion net_perfcommand might stop operating, or hang, when run on a cluster of GPU servers using PVRDMA. The
bitfusion net_perfcommand tests bandwidth and latency between the vSphere Bitfusion Client and vSphere Bitfusion servers.
vSphere Bitfusion 2.0.1
- During OVA deployment, the vSphere Bitfusion version will be reported at 2.0.0 (not 2.0.1). After deployment the version will be reported correctly.
vSphere Bitfusion 2.0.0
- BUG: Downloading does not work inside the plugin in the current iteration of Chrome (logs, backups). This is due to Chrome 83 and above restricting downloads within sandboxed iFrames. The vSphere Bitfusion team is working with the vCenter Server team to determine a solution for U1 release
- Servers must be added to a cluster serially. Don't boot them all at the same time.
- You cannot add a node to cluster if one node is down. User either needs to delete the node or restart the node and then add a new node.
If you delete a server and eventually wish to have it rejoin you will first need to remove its /etc/bitfusion/bitfusion-manager.yaml file.
- You cannot change whether or not to configure additional network adapters in the Customer vApp properties dialog during a server clone operation in vCenter Server. The "Configure Network Adapter" Yes/No values cannot be changed during the clone due to a bug in the 188.8.131.5200 version of the vSphere Client.
Here are two different ways that you can work around this issue:
When creating the original copy that you intend to clone, make sure that you enable the network interfaces that you need.
- Use the vSphere Client vApp Options editor to change the values of these settings. vCenter > (select Bitfusion Server VM) > Configure > vApp Options > (pick field to edit) > EDIT.
- Scaling limit is not known. If you intend to have more than 25 servers, please confer with VMware support personnel.