Thursday, February 14, 2013

FCoE gives me MSSQL headaches!

Converging networks is all the rage.  I personally am a fibre channel guy (maybe I just like the word 'lossless'), but we've done some testing on FCoE due to interest from others.  The testing has not gone well.

Under a moderate load changes from our ETL, IO related errors occur.  Eventually, pending writes reach the system limit and SQL Server crashes. Changing nothing other than the data network protocol allows the workflow to complete.  The errors are not reproducible with batch report workloads, only the ETL workload.

Although the errors listed below mention corruption as a possible cause, dbcc checkdb never indicated any type of corruption remaining after the system was recovered.

Here's the matrix of what we've tested.  Every FCoE test failed on physical server and succeeded on VMWare.  Fibre channel succeeded in every test.  The iSCSI protocol was not thoroughly tested, but some workflows were performed against it and did not present the error conditions.

(Physical Server, VMware ESXi Hypervisor)
*
(Emulex CNA, Cisco CNA)
*
(Windows Server 2008 R2 SP1,Windows Server 2012)
*
(SQL Server 2008 R2, SQL Server 2012)


My colleague describes one of the first bouts with this condition at the following link, including discussion of the test scenario and some of the hardware setups we've worked with.
http://bit.ly/12HAMVN

The failed sessions start their anguish with a report of delayed IO return, even though perfmon shows good IO response time, and SQLio tests on the storage subsystem showed good response times above the level of peak activity during testing.

The order of events in the SQL Server error log, with excerpted examples below:
1. Delayed io message 833. (Although the message indicates the C drive, the database file is on a mounted volume at directory c:\db\mydb\8).
2.  Flushcache due to delayed checkpoint.
3.  Latch timeouts
4.  Failed checkpoint leads attempted rollback
5.  Rollback fails.


~~~~~~
SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [c:\db\mydb\8\MyDB_08.ndf] in database [MyDB_24] (5).  The OS file handle is 0x0000000000000DD8.  The offset of the latest long I/O is: 0x000001a8ef0000
2012-10-09 16:04:44.15 spid16s     FlushCache: cleaned up 102633 bufs with 27634 writes in 346041 ms (avoided 488 new dirty bufs) for db 5:0
2012-10-09 16:04:44.15 spid16s                 average throughput:   2.32 MB/sec, I/O saturation: 18149, context switches 22248
2012-10-09 16:04:44.15 spid16s                 last target outstanding: 4800, avgWriteLatency 0

2012-10-09 16:11:48.00 spid116     A time-out occurred while waiting for buffer latch -- type 3, bp 000000190555A5C0, page 3:2094792, stat 0x1010f, database id: 5, allocation unit Id: 6488064, task 0x0000001596308928 : 0, waittime 300 seconds, flags 0x19, owning task 0x000000178E73C188. Not continuing to wait.

2012-10-09 16:11:50.04 spid72      A time-out occurred while waiting for buffer latch -- type 4, bp 0000001939A62C40, page 3:11267982, stat 0x1010f, database id: 5, allocation unit Id: 72057595887288320, task 0x0000001814E05498 : 0, waittime 300 seconds, flags 0x19, owning task 0x000000178E73C188. Not continuing to wait.

2012-10-09 16:12:56.66 spid90      A time-out occurred while waiting for buffer latch -- type 2, bp 000000192ACEA380, page 10:4012592, stat 0x40d, database id: 5, allocation unit Id: 536870912/281475513581568, task 0x000000146EB48188 : 12, waittime 300 seconds, flags 0x3a, owning task 0x000000146EB48188. Not continuing to wait.

2012-10-09 16:17:33.69 spid60      Timeout occurred while waiting for latch: class 'ACCESS_METHODS_DATASET_PARENT', id 00000017D7C426C0, type 4, Task 0x00000012EE1A8928 : 31, waittime 300 seconds, flags 0x1a, owning task 0x00000017D7F7E558. Continuing to wait.

~~~~~~



Some background information on checkpoint, for my own future reference.

http://blogs.msdn.com/b/joaol/archive/2008/11/20/sql-server-checkpoint-problems.aspx
http://blogs.msdn.com/b/psssql/archive/2008/04/11/how-it-works-sql-server-checkpoint-flushcache-outstanding-i-o-target.aspx
http://blogs.msdn.com/b/psssql/archive/2012/06/01/how-it-works-when-is-the-flushcache-message-added-to-sql-server-error-log.aspx


Poor storage performance could cause something like this - its the usual cause of the delayed IO messages.  But there is no evidence of a performance problem on the storage.  The guy that configured the storage knows what he's doing, from a general storage config standpoint and certainly from a SQL Server standpoint.  Anyone can make a mistake, but after weeks of wondering about this, if there was an easy-to-catch storage config error, he would have found it.
 
Database corruption could cause a problem like this.  But dbcc checkdb shows nothing.  And, finally - just changing to access the disks by fibre channel fixes the problem.
By cranking up the logging level from the Emulex driver, we were able to get an IO error out of the Windows log.  Before doing that, there was no indication of trouble in the error log.
 

I used this 97 page driver documentation pdf as the decoder ring.
On page 72 you'll find the beginning of:
Appendix A Error/Event Log Information, FC/FCoE Error/Event Logs

Here's what we got out of the system error log.
Log Name:      System  Source:        elxcna  Date:          10/17/2012 2:31:55 PM  Event ID:      11  Task Category: None  Level:   Error  Keywords:   Classic  User:   N/A  Computer:  R710-07.domain.com  Description: The driver detected a controller error on \Device\RaidPort1. Event Xml: <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event"> <System> <Provider Name="elxcna" /> <EventID Qualifiers="49156">11</EventID> <Level>2</Level> <Task>0</Task> <Keywords>0x80000000000000</Keywords> <TimeCreated SystemTime="2012-10-17T19:31:55.129962700Z" /> <EventRecordID>6255</EventRecordID> <Channel>System</Channel> <Computer>R710-07.domain.com</Computer> <Security /> </System> <EventData> <Data>\Device\RaidPort1</Data> <Binary>0F00180001000000000000000B0004C0AD02000000000000000000000000000000000000000000000000000000000000000000000B0004C00000000000000000</Binary> </EventData> </Event> 
The Emulex codes had me thinking for a moment that there was a physical problem with the cabling. Until the same workload on the same server with SQL Server running on VMWare, accessing the storage through FCoE was successful.
We've pulled in Emulex and Microsoft.  No resolution yet.  Failures reproduced with Cisco CNAs as well.  So - what's up?  
Anyone else seen such issues with FCoE and SQL Server?


3 comments:

  1. Yea... same error here on an HP BL460c blade hosting Win2k8r2 and with SQL 2008 R2. Happens outside of any patterns and only a few times a month.
    Firmware Version: 4.6.247.5
    Driver Version: 2.74.14
    Fiber Switch is HP Brocade 510
    OS Version 7.1.1

    ReplyDelete
  2. Hello Steve! Wish I could say we've diagnosed or fixed this... but we've found a somewhat-workaround for now: we increased the max size of the transaction log. Before the latch timeout, we noticed a tremendous number of attempted log truncations (this db is in simple recovery mode). By increasing the maximum txlog size, it seems like we bought ourself a little more time and kept the problem from progressing - allowing the system enough time to recover.
    Unfortunately we have no idea how to predict when txlog max size is insufficient... and we resort to increasing it by ~8000 mb whenever we run into the error condition.

    ReplyDelete
  3. Hey SQL_Sasquatch, any luck now that we're moving on almost 2 years! =) We still receive the errors and the frequency just picked up again.

    ReplyDelete