Saturday, April 26, 2014

Database Performance Investigation/Intervention, MTPoD, Last Resorts

In business continuity and disaster recovery planning, one of my favorite foundational ideas is Maximum Tolerable Period of Disruption, or MTPoD. Although most effort and budget will be invested in defining and implementing RPO and RTO, during a crisis having a settled MTPoD can guide the team and resources, and there can be some level of confidence if a last resort is considered or employed.

Although most of my work these days focuses on performance and scalability rather than BC or DR, I'm a "last resort" kinda guy - and this is a "last resort" kinda blog. I didn't plan for this to be my position, but I ain't complainin' either :-)

See, there is room - even a need, I believe - in performance and scalability work for "last resorts". In order to be confident when considering performance/scalability last resorts, I believe MTPoD is very important.

An ideal performance/scalability investigation and intervention may look like this:
1. User feedback, task monitoring, or system monitoring triggers investigation
2. Activity, performance, resource utilization, error, and change logs are reviewed and compared between baseline and problem contexts.
3. Potential suspects are identified, and additional diagnostic monitoring may be put in place in production.
4. Problem recreation attempted in nonproduction environment.
5. If initial nonprod recreation attempts are unsuccessful, the additional production diagnostics may give more insight into reproducing the problem.
6. If problem is reproduced in nonproduction, further diagnostics can be performed there. That's important because sometimes conclusive diagnostics are too invasive or consume too many resources for production environments.
7. At this point, production or nonprod diagnostics may have a tentative, or even conclusive diagnosis.  Corrective actions can then be implemented and validated in nonproduction.
8. Even if the issue cannot be reproduced in nonproduction, the absence of harmful side effects of potential correctives, such as a SQL Server sp or cu install, can be tested in nonprod.
9. Correctives can be promoted to production per change control processes.
10. Correctives in nonprod and prod are evaluated against expectations. Process iterates to step 2 if necessary.

But... sometimes it's even uglier than that.  Sometimes, system behavior is truly unhealthy... despite pulling in experts for weeks... gathering lots of diagnostic data... addressing evident tuning opportunities... the system is still unhealthy.

I'm an advocate for understanding database system behavior - healthy and unhealthy.  But I also know that there are times for last resorts... sometimes service disruption comes near to MTPoD. I've been there. And I've called for action... sometimes in a room of folks who look at my whiteboard scribbles and think I may not be in my right mind... and sometimes while other experts are voicing the protest that they have never "seen that work".  

I haven't always been right in those situations. I believe in stating my level of confidence, and not sugar-coating the risks.  But often, when I'm involved its because the situation is already kinda desperate.

And, if I've had access to the diagnostics I ask for... I feel pretty good about my track record of being right :-)

Database erformance and scalability investigation/intervention is tough work.  It's complex, in both breadth and depth.  It's risky.  Ask for a downtime to implement a "fix"... and if the benefit isn't evident afterward you may have burned a lot of trust.  (That's one reason I believe in expressing confidence level for both diagnoses and remedies.)  And if MTPoD is approaching, standard process may need to be suspended: maybe possible correctives need to be implemented even though potential diagnoses are highly speculative, maybe exceptions to normal change control need to be employed.  Having a sense of MTPoD can minimize regret when performance investigation/intervention deviates from standard operating procedure.*

*When an extraordinary "fix" resolves an issue that had no strong suspect for diagnosis, I do recommend continuing to pursue diagnosis in post-Mortem analysis.  Sometimes, though, the nature of the previous problem won't be uncovered without an unreasonable cost.


No comments:

Post a Comment