Tuesday, April 21, 2015

3rd Try Charm: Why does pvscsi rather than LSI matter so much for SQL Server on VMware?

I've written two blog posts recently on the performance benefits of the VMware pvscsi vHBA over the LSI vHBA.

March 23, 2015
SQL Server on VMware: LSI vHBA vs pvscsi vHBA

April 7, 2015
Another SQL Server VMware LSI vs pvscsi vHBA blog post

These posts give details from different systems (although running a similar workload), and claim a benefit of roughly 10 fold in peak throughput and peak disk response times by switching from the default LSI vHBA to the pvscsi vHBA for the SQL Server LUNs.  That sounds a little fishy, doesn't it?  Especially if you are familiar with...

Achieving a Million I/O Operations per Second from a Single VMware vSphere® 5.0 Host

Page 10 of the performance study above includes the following text.
"… a PVSCSI adapter provides 8% better throughput at 10% lower CPU cost."

That is a much more modest (and probably more believable) claim than 10x performance benefit.  What gives?

Here's a few details about the mechanics:
LSI vHBA has adapter queue depth 128, cannot be increased. LUN queue depth cannot be increased from default 32.
pvscsi vHBA has default adapter queue depth 256, and default LUN queue depth 64.  Adapter queue depth can be increased to 1024, LUN queue depth to 256 with Windows registry settings.

And here's a detail about the testing that I just happened to come across in the Longwhiteclouds blog.

"Maximum queue depth supported by LSI driver (in guest) cannot be changed. So to keep the aggregate outstanding I/Os per controller lower than the max queue depth we had to use 16 OIOs per vDisk. To have a fair comparison between LSI and pvSCSI, second test also had 16 OIOs per vDisk for pvSCSI as well. " Chethan Kumar, author of the VMware paper, as quoted on Longwhiteclouds.

So, comparison testing was done within the queue depth constraints of the LSI vHBA.  But, in the case of these Enterprise Edition SQL Server workloads, the number of outstanding IOs would often exceed 600 and microbursts as high as 1000 outstanding IOs occurred.  That's well outside the LSI adapter queue depth, and the queuing penalty can be high.  Even with 4 LSI adapters in a VM, the aggregate adapter queue depth would be only 512.

If a SQL Server workload doesn't burst more than 32 outstanding IOs per LUN or more than 128 outstanding IOs per vHBA adapter, the change to pvscsi would most likely bring rather modest performance benefits - along the lines of the 8% better throughput at 10% lower CPU utilization indicated in the whitepaper.  In fact, at that low level of outstanding IO... maybe there would be a slight performance decline.  That's because the LSI vHBA can allow an IO request up to 32mb in size.  SQL Server won't (yet) perform disk IO that large.  The largest disk IO I've seen from SQL Server has been 4 mb.*  The pvscsi vHBA adapter currently will allow a maximum disk IO size of 512kb.

However - really large disk IOs from SQL Server are in my experience fairly rare, and high aggregate queue length is more common.  For that reason, I heartily recommend using the pvscsi vHBA for SQL Server vms.  Retaining the LSI vHBA for the boot drive is common, even when pvscsi vHBAs are added for database LUNs.  I've got nothing against that approach. But its important to ensure that a SQL Server vm can handle the outstanding IO generated by its workload.  CPUs are hungry - feed them lots of data quickly :-).

*But Niko has shown that columnstore will do read IO up to 8 mb :-)
Clustered Columnstore Indexes – part 50 ("Columnstore IO")

Thursday, April 16, 2015

SQL Server PLE is a *very* tricky performance metric

PLE, or page life expectancy, is a metric that can be retrieved from SQL Server DMVs or from perfmon when monitoring a system.

It is a measure of the expected time a database page entering the buffer cache pool will remain before being evicted.  Pressure on the buffer cache pool causes pages from the pool to be evicted in order that the bpool page can be added to the free list - that pressure can come from database pages being read in, or from pages being allocated and written to the first time (first writes to a page don't require the page to be read), or from buffer pool pages being stolen by other memory consumers such as plan cache, query memory, etc.  Finally, the buffer pool may be pressured to evict contents so that the "total server memory" can shrink in response to low server memory.

In addition to the mix of factors which can change the buffer pool eviction rate, PLE is further complicated by the fact that on a NUMA server, the buffer pool is by default split into sections such that there are as many sections of the bpool as NUMA nodes - each section then actually has its own PLE.  The general PLE reported for the instance is an aggregation of the PLEs for each NUMA node.

But - it gets trickier :-)  Online there are still lots of references to a rule-of-thumb of 300 as a PLE performance threshold number.  My response to that particular recommendation, which is typically recognized as an outdated recommendation, is that the perfomance target for PLE is extremely workload and system dependent.  Without a baseline of PLE for a workload on a system, and without other supporting metrics for the same workload on the system, most decisions based on PLE will really boil down to guesses

I'll show you what I mean.  Here are two graphs, of the same system running the same ETL (although with daily variation in data) separated by a few weeks.

On the earlier ETL execution, from 2:10 to 3:00 there were up to 100 queries executing at a time, and PLE ranged from just over 100 to almost 400.  Knowing that the old rule-of-thumb is 300 for PLE, maybe I'd want to start investigating.  But at the same time - it didn't stay below 300 - so maybe its not a problem?

Fast forward by a few weeks.  Still less than 100 concurrent queries - but most of the time from 2:10-3:00 there are fewer queries running in this graph than there were before. But PLE never gets above 100!  Now there's for sure a problem!

Not so fast.  Lets throw in CPU utilization for SQL Server and compare.

Huh.  Quite a bit more CPU utilization in the second, green graph than in the first red one.  Since the workload is the same (same ETL jobs/packages running in same order although threading/dependencies can cause some timing differences), we can also look at request completion as a gauge for the pace of work.

Just a moment ago it looked like the second day's ETL was experiencing a performance problem.  Now it looks like maybe performance was better on the second day - it was able to accomplish more work per unit of time?

Because this particular system has a working set much larger than server RAM, bytes/sec processed can also be used as a measure of work.

Oh.  So, during the first timeperiod the system was experiencing major performance degradation due to sky-high read response times, which limited read bytes/sec.  During the second timeperiod latency was much, much lower and throughput was much higher.

No queries were tuned in the interim - I ain't necessarily the guy you want tuning queries :-)  But we took steps to make the storage accommodate the workload much better.  In turn, the pace of work increased - the pace of query completion increased.  Because data was returning from disk faster, disk IO wait decreased and CPU utilization increased.   Reads were making it into the buffer pool faster, forcing pages to need to be evicted sooner.  That lowers PLE.  Because it was an ETL, increased pace of reads allowed for an increased rate of logical writes(assuming transactions weren't impeded by high write log waits, or log buffer waits).  Every database page newly allocated in the buffer pool by writes also worked to lower PLE.

So in this case, lower PLE was certainly not a symptom of a performance problem.  Rather, it was a sign that efforts to improve system performance were successful. :-)

This can happen often.  Replace an HDD storage system with an all flash array(AFA) on a system that performs a workload like this?  Everything else being equal, I'd expect IO wait to drop, CPU utilization to increase, and PLE to drop.  That drop in PLE wouldn't be a sign of a performance problem - but rather a sign of the system performance improvement rendered by the AFA.  Similarly, if new, faster CPUs are applied to the workload with the same amount of RAM - to the extent that the workload finishes faster I'd expect the PLE to drop.

This is a very important type of performance analysis and evaluation for batch-focused workloads which sometimes gets lost in the (largely) OLTP-focused world.  PLE and similar system metrics have a whole different meaning for OLTP systems - especially for end user OLTPs.  That's because improving system performance for a given end user system with a fixed number of end users has a hard stopping point.  No amount of system tuning can get rid of the minimal required 'user think time' required to continue driving the workload :-)  Also, batch workloads can be greatly assisted by readahead.  The smaller queries typically executed in an OLTP setting reap minimal benefits from readahead, instead relying on data to already be in cache to avoid disk IO wait time.  So PLE becomes a much more important concept.

I don't want anyone to ignore the PLE - it can be a very important measure.  But it can also be very tricky - gotta have supporting evidence to be sure of what it tells you.