Friday, August 30, 2013

IBM fixes "minor Enhanced Affinity Bug"

While doing some late night reading, I came across the "minor Enhanced Affinity Bug" described below.  This defect is one among what I expect is many that result in paging space writes even when there seems to be sufficient free server memory to avoid paging space writes.

My recommendation to IBM: go ahead and expose the memory pool/frameset numbers in vmstat, or some other utility that does not require root privileges to return its information.  With a good tool and a clear explanation, there are lots and lots of sharp AIX administrators that could help correct these issues much earlier. 

One alternative is to fix these issues one at a time and slowly, in part due to the difficulty in obtaining the stats that would speed up full diagnosis.  Another alternative is to continue trying to address these issues with Active System Optimizer, Dynamic System Optimizer (the separately licensed extension module for ASO), Dynamic Platform Optimizer, and the many kernel parameters and environment variables which each slightly tune a specific facet of memory management. 

I can't recommend to critical systems that they enter a cycle of an unknown number of iterations of defect trigger, diagnosis, intervention, and evaluation.  Especially since the intervention may require installing a fix, or changing a kernel parameter requiring a bosboot and reboot to take effect.

I also can't recommend addressing an issue that arises from a fairly complex set of factors by adding something to the mix which is itself complex and fairly young. Consider that the 7 APARs listed at the end of this post are all contained in one service pack - AIX 6.1 TL8 SP2.  That's a lotta fixes for one daemon.

If I've got Big Blue's attention, I'll share a secret: Oracle's engineered systems will mop the floor in a comparison test with Oracle running on a Power system that is experiencing a lot of paging space traffic. 

Its not unusual for a recently configured system to choke severely under moderate use of paging space disk.  QFULLs may be seen on the paging space pvol(s) under moderate use.  Additional degradation can result from psbuf waits and even free frame waits.

I've got nothing against engineered systems in general or Oracle's engineered systems: for a given workload, let the best system strut its stuff.  But if the Power system is choking on paging space, it won't even be a fair fight.  Lets help folks make their IBM Power memory management predictable, and I bet real world performance will be quite grateful.


****

Append APAR ID to the following for documentation URL
http://www-01.ibm.com/support/docview.wss?uid=isg1

****

Fix Minor Enhanced Affinity Bug

sql_sasquatch defect description:
The ra_attach system call will, when able, allocate physical memory from domain local to the thread.  If necessary, memory from a "near" domain can be used for an ra_attach allocation.  However, due to a bug the first domain (0) is not considered "near" any other domain.  This may result in paging space writes when domains other than 0 are under pressure - even if domain 0 is near the stressed domain and has sufficient free memory for the allocation.

6100-06 IV28494
6100-07 IV28320
6100-08 IV27739
7100-00 IV28830
7100-01 IV29045
7100-02 IV27797

****

The following AIX 6100-08APARs address core dumps or other significant defects in the ASO daemon (Active System Optimizer).  Because the Dynamic System Optimizer module extends the ASO, I assume DSO is also compromised by these defects.

IV26296 Incorrect assert in SystemUtility_is_sid_in_pidspace
IV26301    ASO coredump at gmap_delete
IV26807    ASO core at AsyncWorker_add_work
IV26808    ASO core dump in StrategyBag_getLength
IV26810    ASO loops in large page job creation
IV27163    ASO core in large page job creation
IV35517    ASO core dump in pstatus_update_hotsegs when retrying MPSS op


*****
I'm gonna throw this one on at the end because otherwise I keep losing track of it.



IV10657: SRAD LOAD BALANCING ISSUES ON SHARED LPARS APPLIES TO AIX 6100-08

If tasks are distributed unevenly and its a problem... new memory allocations are probably also unevenly distributed, resulting in another problem :)
 

No comments:

Post a Comment