When your ESXi connects to a storage device, it might experiences a connectivity problem. Storage connectivity problems might be caused by various reasons. Although ESXi cannot always detect the reason for a device or its paths being unavailable, the host can determine whether the problem is permanent or temporary. In other words, the host can differentiate between a permanent device loss (PDL) state of the device and a transient all paths down (APD) state of storage.

Permanent Device Loss (PDL)
A condition that occurs when a storage device permanently fails or is administratively removed or excluded. It is not expected to become available. When the device becomes permanently unavailable, ESXi receives appropriate sense codes or a login rejection from storage arrays, and is able to recognize that the device is permanently lost.
All Paths Down (APD)
A condition that occurs when a storage device becomes inaccessible to the host and no paths to the device are available. ESXi treats this as a transient condition because typically the problems with the device are temporary and the device is expected to become available again.

Connectivity Problems and vSphere High Availability

When the device enters the PDL or APD state, vSphere High Availability (HA) can detect connectivity problems and provide automated recovery for affected virtual machines on the ESXi host. vSphere HA uses VM Component Protection (VMCP) to protect virtual machines running on the host in the vSphere HA cluster against accessibility failures. For more information about VMCP and how to configure responses for datastores and virtual machines when the APD or PDL condition occurs, see the vSphere Availability documentation.

Detecting PDL Conditions

A storage device is considered to be in the permanent device loss (PDL) state when it becomes permanently unavailable to your ESXi host.

Typically, the PDL condition occurs when a device is unintentionally removed, or its unique ID changes, or when the device experiences an unrecoverable hardware error.

When the storage array determines that the device is permanently unavailable, it sends SCSI error sense codes or NVMe error codes to the ESXi host. After receiving these errors, your host recognizes the device as failed and registers the state of the device as PDL. For the device to be considered permanently lost, the sense codes must be received on all its paths.

After registering the PDL state of the device, the host stops attempts to reestablish connectivity or to send commands to the device.

The vSphere Client displays the following information for the device:
  • The operational state of the device changes to Lost Communication.
  • All paths are shown as Dead.
  • Datastores on the device are not available.

If no open connections to the device exist, or after the last connection closes, the host removes the PDL device and all paths to the device. You can deactivate the automatic removal of paths by setting the advanced host parameter Disk.AutoremoveOnPDL to 0.

If the device returns from the PDL condition, the host can discover it, but treats it as a new device. Data consistency for virtual machines on the recovered device is not guaranteed.

Note: When a device fails without sending appropriate SCSI sense codes or NVMe error codes or an iSCSI login rejection, the host cannot detect PDL conditions. In this case, the host continues to treat the device connectivity problems as APD even when the device fails permanently.

Permanent Device Loss and SCSI Sense Codes

The following VMkernel log example of a SCSI sense code indicates that the device is in the PDL state.
H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0 or Logical Unit Not Supported

Permanent Device Loss and NVMe Error Codes

The following VMkernel log example of an NVMe error code indicates that the device is in the PDL state.
H:0x0 D:0xb P:0x0 or H:0x0 D:0x11a P:0x0

Permanent Device Loss and iSCSI

On iSCSI arrays with a single LUN per target, PDL is detected through iSCSI login failure. An iSCSI storage array rejects your host's attempts to start an iSCSI session with a reason Target Unavailable. As with the sense codes, this response must be received on all paths for the device to be considered permanently lost.

Permanent Device Loss and Virtual Machines

After registering the PDL state of the device, the host closes all I/O from virtual machines. vSphere HA can detect PDL and restart failed virtual machines.

Performing Planned Storage Device Removal

When a storage device is malfunctioning, you can avoid permanent device loss (PDL) conditions or all paths down (APD) conditions. Perform a planned removal and reconnection of a storage device.

Planned device removal is an intentional disconnection of a storage device. You might also plan to remove a device for such reasons as upgrading your hardware or reconfiguring your storage devices. When you perform an orderly removal and reconnection of a storage device, you complete a number of tasks.

Task Description
Migrate virtual machines from the device you plan to detach. vCenter Server and Host Management
Unmount the datastore deployed on the device. See Unmount Datastores.
Detach the storage device. See Detach Storage Devices.
For an iSCSI device with a single LUN per target, delete the static target entry from each iSCSI HBA that has a path to the storage device. See Remove Dynamic or Static iSCSI Targets.
Perform any necessary reconfiguration of the storage device by using the array console. See your vendor documentation.
Reattach the storage device. See Attach Storage Devices.
Mount the datastore and restart the virtual machines. See Mount Datastores.

Detach Storage Devices

Safely detach a storage device from your ESXi host.

You might need to detach the device to make it inaccessible to your host, when, for example, you perform a hardware upgrade on the storage side.

Prerequisites

  • The device does not contain any datastores.
  • No virtual machines use the device as an RDM disk.
  • The device does not contain a diagnostic partition or a scratch partition.

Procedure

  1. In the vSphere Client, navigate to the ESXi host.
  2. Click the Configure tab.
  3. Under Storage, click Storage Devices.
  4. Select the device to detach and click the Detach icon.

Results

The device becomes inaccessible. The operational state of the device changes to Unmounted.

What to do next

If multiple hosts share the device, detach the device from each host.

Attach Storage Devices

Reattach a storage device that you previously detached from the ESXi host.

Procedure

  1. In the vSphere Client, navigate to the ESXi host.
  2. Click the Configure tab.
  3. Under Storage, click Storage Devices.
  4. Select the detached storage device and click the Attach icon.

Results

The device becomes accessible.

Recovering from PDL Conditions

An unplanned permanent device loss (PDL) condition occurs when a storage device becomes permanently unavailable without being properly detached from the ESXi host.

The following items in the vSphere Client indicate that the device is in the PDL state:
  • The datastore deployed on the device is unavailable.
  • Operational state of the device changes to Lost Communication.
  • All paths are shown as Dead.
  • A warning about the device being permanently inaccessible appears in the VMkernel log file.

To recover from the unplanned PDL condition and remove the unavailable device from the host, perform the following tasks.

Task Description
Power off and unregister all virtual machines that are running on the datastore affected by the PDL condition. See vSphere Virtual Machine Administration.
Unmount the datastore.

See Unmount Datastores.

Rescan all ESXi hosts that had access to the device.
Note: If the rescan is not successful and the host continues to list the device, some pending I/O or active references to the device might still exist. Check for any items that might still have an active reference to the device or datastore. The items include virtual machines, templates, ISO images, raw device mappings, and so on.
See Perform Storage Rescan.

Handling Transient APD Conditions

A storage device is considered to be in the all paths down (APD) state when it becomes unavailable to your ESXi host for an unspecified time period.

The reasons for an APD state can be, for example, a failed switch or a disconnected storage cable.

In contrast with the permanent device loss (PDL) state, the host treats the APD state as transient and expects the device to be available again.

The host continues to retry issued commands in an attempt to reestablish connectivity with the device. If the host's commands fail the retries for a prolonged period, the host might be at risk of having performance problems. Potentially, the host and its virtual machines might become unresponsive.

To avoid these problems, your host uses a default APD handling feature. When a device enters the APD state, the host turns on a timer. With the timer on, the host continues to retry non-virtual machine commands for a limited time period only.

By default, the APD timeout is set to 140 seconds. This value is typically longer than most devices require to recover from a connection loss. If the device becomes available within this time, the host and its virtual machine continue to run without experiencing any problems.

If the device does not recover and the timeout ends, the host stops its attempts at retries and stops any non-virtual machine I/O. Virtual machine I/O continues retrying. The vSphere Client displays the following information for the device with the expired APD timeout:
  • The operational state of the device changes to Dead or Error.
  • All paths are shown as Dead.
  • Datastores on the device are dimmed.

Even though the device and datastores are unavailable, virtual machines remain responsive. You can power off the virtual machines or migrate them to a different datastore or host.

If later the device paths become operational, the host can resume I/O to the device and end the special APD treatment.

Deactivate Storage APD Handling

The storage all paths down (APD) handling on your ESXi host is activated by default. When this functionality is activated and a storage device enters the APD state, the host continues to retry nonvirtual machine I/O commands only for a limited time period. After the time period expires, the host stops its retry attempts and terminates any nonvirtual machine I/O. You can deactivate the APD handling feature on your host.

If you deactivate the APD handling, the host will indefinitely continue to retry issued commands in an attempt to reconnect to the APD device. This behavior might cause virtual machines on the host to exceed their internal I/O timeout and become unresponsive or fail. The host might become disconnected from vCenter Server.

Procedure

  1. In the vSphere Client, navigate to the ESXi host.
  2. Click the Configure tab.
  3. Under System, click Advanced System Settings.
  4. In the Advanced System Settings table, select the Misc.APDHandlingEnable parameter and click the Edit icon.
  5. Change the value to 0.

Results

If you deactivate the APD handling, you can reactivate it and set its value to 1 when a device enters the APD state. The internal APD handling feature turns on immediately and the timer starts with the current timeout value for each device in APD.

Change Timeout Limits for Storage APD

The timeout parameter controls how many seconds the ESXi host must retry the I/O commands to a storage device in an all paths down (APD) state. You can change the default timeout value.

The timeout period begins immediately after the device enters the APD state. After the timeout ends, the host marks the APD device as unreachable. The host stops its attempts to retry any I/O that is not coming from virtual machines. The host continues to retry virtual machine I/O.

By default, the timeout parameter on your host is set to 140 seconds. You can increase the value of the timeout if, for example, storage devices connected to your ESXi host take longer than 140 seconds to recover from a connection loss.

Note: If you change the timeout parameter after the device becomes unavailable, the change does not take effect for that particular APD incident.

Procedure

  1. In the vSphere Client, navigate to the ESXi host.
  2. Click the Configure tab.
  3. Under System, click Advanced System Settings.
  4. In the Advanced System Settings table, select the Misc.APDTimeout parameter and click the Edit icon.
  5. Change the default value.
    You can enter a value between 20 and 99999 seconds.

Verify the Connection Status of a Storage Device on ESXi Host

Use the esxcli command to verify the connection status of a particular storage device.

Prerequisites

Install ESXCLI. See Getting Started with ESXCLI. For troubleshooting, run esxcli commands in the ESXi Shell.

Procedure

  1. Run the esxcli storage core device list -d=device_ID command.
  2. Review the connection status in the Status: area.
    • on - Device is connected.
    • dead - Device has entered the APD state. The APD timer starts.
    • dead timeout - The APD timeout has expired.
    • not connected - Device is in the PDL state.