Monday, November 6, 2017

SQL Server 2017 - soft-NUMA limits MAXDOP? Nope.

While at PASS Summit this year (2017), I was in a discussion about distribution of SQL Server tasks and connections.  It's a common theme with me - I often work with batch workloads and I want work as evenly distributed over available compute resources as possible, for as long as possible, in order to minimize elapsed workload time.

I mentioned that I was planning to set up a soft-NUMA node for each vcpu on a 16 vcpu VM, to evenly distribute incoming connections and thus DOP 1 queries over vcpus.  Thomas Kejser et al used this strategy to good effect in "The Data Loading Performance Guide", which used SQL Server 2008 as a base.
https://technet.microsoft.com/en-us/library/dd425070(v=sql.100).aspx

My conversation partner cautioned me that leaving this soft-NUMA configuration in place after the specialized workload would result in DOP 1 queries whether I wanted them or not.  The claim was, effectively, a parallel query plan generated by a connection within a soft-NUMA node would have its MAXDOP restricted by the scheduler count (if lower than other MAXDOP contributing factors).  Though I wasn't able to test at the time, I was skeptical: I'd always thought that soft-NUMA was consequential to connection placement, but not to MAXDOP nor to where parallel query workers would be assigned.

I'm back home now... time to test!! This test is on Microsoft SQL Server 2017 (RC2) - 14.0.900.75, on my dual core (with HT) laptop.  I've set up soft-NUMA, with one soft-NUMA node for each scheduler/logical CPU.  Notice below that all of the soft-NUMA nodes (in sys.dm_os_nodes) are in a single memory node.  That's actually the single physical NUMA node of my laptop.


So what happens when a parallel query is attempted on this system with soft-NUMA nodes of 1 scheduler each? It still gets DOP 4. And the actual plan shows each of the 4 parallel workers doing some work.



For the sake of completeness, I also want to show that work is being distributed across the schedulers in this setup.  Let's create and populate a temp table for the tare.


Now let's run the test query 64 times.


OK, let's take the post-test measure and check the delta.


Excellent. So we've seen that even with soft-NUMA nodes of single schedulers, parallel queries are getting DOP > 1, and work is being distributed across all schedulers (although with short tasks and a single session, not very evenly).  I'm putting the code I used for this test below - although the most interesting part may be setting up the soft-NUMA itself.  I'll leave that for another day :-)


Ciao for now!


SELECT oss.scheduler_id, oss.cpu_id, osn.node_id, osn.memory_node_id, 
       osn.online_scheduler_mask, osn.online_scheduler_count 
FROM sys.dm_os_nodes osn
JOIN sys.dm_os_schedulers oss ON osn.node_id = oss.parent_node_id
WHERE osn.node_state_desc = 'ONLINE'
  AND oss.status = 'VISIBLE ONLINE';

DROP TABLE IF EXISTS #t;
CREATE TABLE #t
    (capture_tm DATETIME,
     scheduler_id INT,
     total_cpu_usage_ms BIGINT
     CONSTRAINT pk_#t PRIMARY KEY CLUSTERED 
         (capture_tm, scheduler_id, total_cpu_usage_ms))
WITH (DATA_COMPRESSION = PAGE);

INSERT INTO #t
SELECT getdate() capture_tm, scheduler_id, total_cpu_usage_ms
FROM sys.dm_os_schedulers
WHERE status = 'VISIBLE ONLINE';

DECLARE @ij INT = 1, @redirect INT;
WHILE @ij < 65
BEGIN
SELECT @redirect = SUM(1) FROM sys.dm_os_nodes os
CROSS JOIN sys.dm_os_nodes os2
OPTION (querytraceon 8649);
SET @ij = @ij +1;
END

INSERT INTO #t
SELECT getdate() capture_tm, scheduler_id, total_cpu_usage_ms
FROM sys.dm_os_schedulers WHERE status = 'VISIBLE ONLINE';

;WITH max_cte AS (SELECT scheduler_id, 
                         MAX(total_cpu_usage_ms) max_ms 
                  FROM #t GROUP BY scheduler_id),
      min_cte AS (SELECT scheduler_id, 
                         MIN(total_cpu_usage_ms) min_ms 
                  FROM #t GROUP BY scheduler_id)
SELECT max_cte.scheduler_id, 
       max_cte.max_ms - min_cte.min_ms AS delta_cpu_usage_ms
FROM max_cte 
JOIN min_cte ON max_cte.scheduler_id = min_cte.scheduler_id;


No comments:

Post a Comment