We believe that, in order to achieve the maximum possible benefit from PMEM devices, applications need to be modified. PMEM-aware applications work in conjunction with direct access (DAX) mode where the NVDIMM device is mounted in the guest OS with DAX filesystem option. Applications can use vPMEM by mapping a region from the device and accessing it directly in a byte-addressable manner.

Examples of such modified applications are SQL Server using tail-of-the-log (TOFL) optimization and VMware’s modified Redis.

SQL Server and TOFL Optimization

We used HammerDB v2.23 to test SQL Server 2016 with various configurations.


  • Up to 16% increase in application performance using SQL TOFL optimization.
  • Up to 22% increase in application performance using vPMEM on top of SQL TOFL.
  • 93x decrease in LogFlushWaitTime

How SQL TOFL works

SQL Server 2016 introduced an optimization known as tail-of-the-log where a portion of the log buffer is mapped in PMEM instead of DRAM. Here are the steps involved in how TOFL works:

  1. Log records are copied persistently into the buffer.
  2. Whenever a commit arrives, the transaction completes, returning control to SQL Server (since it is persistent already).
  3. When the log block is full (typically 64 KB), it is closed-out, ready to be hardened to disk.
  4. I/O is scheduled to persist the full block to disk.

Note: Without TOFL, SQL Server must wait until the log block is hardened to the disk.

Figure 16: SQL Server-TOFL



Windows Server 2016


8 vCPUs


16 GB (SQL.Max.Memory = 14 GB)


100 GB DB, 22 GB logs


100 GB DB, 22 GB logs

Table 11: SQL Server VM configuration

Virtual Users




Warm Up Time

5 minutes

Run Time

25 minutes



Table 12: HammerDB parameters for SQL Server

We used five configurations to quantify the performance gains:

  1. SSD – DB and logs are on NVME SSD
  2. Logs-vPMEMDisk – Only logs are moved to vPMEMDisk (DB stays on SSD)
  3. Logs-vPMEM – Only logs are moved to vPMEM (DB stays on SSD)
  4. TOFL (Tail-of-the-log) – Only a portion of the log buffer (a.k.a TOFL) is moved to vPMEM DAX volume
  5. vPMEM-TOFL – DB and logs both are moved to vPMEM and a portion of the log buffer is on a vPMEM DAX volume


The various numbers reported in Figure 17 to 18 and Table 13 are obtained using Windows perfmon counters and SQL Server perfmon events. More details about the counter descriptions can be found in Appendix C.

Figure 17 shows the IOPS breakdown for the different configurations. In the first three configurations, there is no SQL-specific PMEM change. It shows 20% increase in log writes per second just by moving the logs (22 GB) to vPMEM. This results in a 12% increase in performance as shown in Figure 24. The reason for more log writes is because LogFlushWaitTime is reduced by 4.5x to 65 msecs. In the TOFL configuration, only 20 MB of log buffer is mapped as byte-addressable PMEM.

The LogFlushWaits/sec event in Figure 18. It reduced by 114x to a value of 53. This is because SQL Server does not need to wait until the I/O is persisted on the disk. As soon as the commit arrives, the transaction is complete. The DB log writes per second are just 340 as shown in Figure 18. The reason is that the perfmon counter does not capture direct writes (memory load/store) to PMEM since it is in DAX mode. The log write size increased to 61 KB (3.6 KB in the SSD case) in TOFL. In the vPMEM-TOFL configuration, the LogFlushWaitTime is reduced to 3.2 msecs, and the performance is increased by 22% compared to the baseline NVMe SSD case. The CPU utilization of the VM is 95% in the vPMEM-TOFL configuration as compared to baseline (81% CPU utilization).


  1. % Processor Time perfmon counter is used for the CPU numbers.
  2. LogFlushes/sec in Figure 18 is roughly equal to DB log writes/sec in Figure 17.
  3. We do not show latency numbers because Windows perfmon latency counters cannot capture anything less than 10 milliseconds of latency.

Figure 17: HammerDB with SQL Server IOPs breakdown



Figure 18: SQL Server Perfmon Flush Events



LogFlushWaitTime (msecs)











Table 13: SQL Server Perfmon flush time event

Figure 19: HammerDB throughput gain with SQL Server

VMware’s Modified Redis

Redis is a popular open source, in-memory key-value store [18]. Redis can be used as a database, cache, or message broker. In this instance, we used Redis as a database with perfect consistency (sync every SET operation to backing storage), and compared it to the PMEM-aware Redis that is developed in-house by VMware, using the new PMEM software libraries [17]. PMEM-aware Redis offers perfect consistency by default because both SET and GET operations are done in PMEM as opposed to in DRAM as in the case of unmodified Redis.

PMEM-aware Redis offers the following benefits:

  • Better performance with persistence and perfect consistency for all operations. This makes a case for using PMEM-aware Redis as a primary database.
  • Instant recovery from crash. PMEM-aware Redis does all operations to and from PMEM, which persists at crash. It does not need to load any data from disk to memory at initialization. On the other hand, unmodified Redis must load the database from disk to memory at initialization, which can take up to several minutes.

We used memtier_benchmark, which is an open source benchmarking tool for key-value stores to drive Redis [19].


  • Redis throughput is 2.4x better in the vPMEM-aware case compared to NVMe SSD.
  • The vPMEM-aware configuration yields 12x better latency.
  • The vPMEM-aware configuration performs much better than vPMEM, making a case to modify applications.


Table 14 gives the details about the Redis VM.



CentOS 7.4


4 vCPUs


128 GB


128 GB


128 GB


128 GB

Table 14: Redis VM configuration

Table 15 shows the Redis parameters used.

DB Size

25M keys (120 GB in memory)

Data Size

4 KB



Throughput parameters

pipeline=4; clients=50; threads=4

Latency parameters

pipeline=1; clients=1; threads=1


0% SETs; 20% SETs; 50% SETs; 80% SETs; 100% SETs

Table 15: Redis workload configuration


Figure 20 shows the normalized throughput reported by memtier_benchmark in different configurations. The 100% SETs case stresses the storage most, in which the vPMEM-aware configuration provided 2.4x better throughput. Moreover, vPMEM-aware throughput is 1.7x compared to vPMEM.

Figure 20: Redis throughput


Figure 21 shows the normalized latency per operation reported by memtier_benchmark. Again, in the 100% SETs case, latency with vPMEM-aware is 12x better than NVMe SSD and 2.8x better than vPMEM.

Figure 21: Redis latency


Figure 22 shows the crash recovery time in seconds. Note that, vPMEM-aware Redis recovers instantly.

Figure 22: Redis crash recovery time

check-circle-line exclamation-circle-line close-line
Scroll to top icon