On October 14, 2015, Cisco released Field Notice 64029 with updates on November 3, 2015. Here’s everything you need to know about this issue that affects Cisco UCS C220 M4 and C240 M4 servers. The field notice addresses and expands upon Cisco Bug CSCuv33991
Lie down on the couch and tell me all your problems
For starters: Cisco UC deployed on these servers, the Trusted Reference Configurations (TRCs) used by most deployments, can face severe performance issues. The possibly affected models include:
- UCMBE7K – BE7H-M4-K9= (based on UCS C240 M4SX)
- UCMBE7K – BE7H-M4-XU= (based on UCS C240 M4SX)
- UCMBE7K – BE7M-M4-K9= (based on UCS C240 M4S2)
- UCMBE7K – BE7M-M4-XU= (based on UCS C240 M4S2)
- UCMBE6K – BE6H-M4-K9= (based on UCS C220 M4S)
- UCMBE6K – BE6H-M4-XU= (based on UCS C220 M4S)
- UCMBE6K – BE6M-M4-K9= (based on UCS C220 M4S)
- UCMBE6K – BE6M-M4-XU= (based on UCS C220 M4S)
All such servers shipped from the initial production starting in Fall 2014 through late July 2015 suffer from a RAID configuration error (an unfortunate mistake) that results in extremely poor disk performance.
Many Cisco UC applications are ultimately storage bound, and the nature of the RAID misconfiguration leads to exponentially worse performance as additional load is put on the disk array. This poor disk performance can lead to a laundry list of issues, especially under load. Cisco’s field notice specifically mentions issues with:
- Fresh installs, upgrades, backups or call detail record exports that take several hours longer than expected (the bug ID clarifies taking six to seven times longer)
- Dropped calls that occur during high call volumes, upgrades, backups or call detail record exports
While Cisco doesn’t call them out specifically, the poor storage performance can affect other storage-bound activities leading to these other possible issues:
- Unexpected device registration drops
- Application pauses that have unpredictable and negative effects on other applications (e.g., Contact Center)
- Significantly increased execution times for bulk administration jobs, provisioning activities using AXL API calls and device firmware upgrades
These are serious problems that any affected Cisco UC customer should make plans to correct as soon as possible.
Let’s get physical, physical — with your storage and RAID configurations
At root, the misconfigurations at Cisco Manufacturing that led to the performance issues revolve around the RAID controller. It takes some pretty in-depth knowledge of storage systems to understand exactly what is wrong and “get” the impact, but let’s do the high-level overview.
Quick Catch-up: RAID stands for Redundant Array of Independent Disks. Many disks act as one logical storage device, leading to an increase in speed and/or redundancy. Cisco uses the level type RAID 5 to provide a mix of both speed and redundancy. The RAID controller “stripes” data across multiple disks in small chunks, along with a parity calculation to allow for continued operation and data reconstruction in the event of a single-disk failure.
The RAID controller (OEM by Avago Technologies, who acquired LSI in May 2014) can further optimize storage access with a few optional settings. The three relevant settings are (in a nutshell):
- Read Policy: When the operating system requests to read data from storage, the RAID controller will instruct individual disks to retrieve data. The data in question may be on a single disk in the array, or may be spread across multiple disks (“striped”). Since a disk is a spinning metal platter, it is quite efficient for the read head of the disk, as it passes over particular sectors, to grab additional sectors on the assumption that the operating system may require additional sequential data. It is also quite efficient for the RAID controller to pull the entire data stripe containing the specific requested data off of all disks in the array. This data is loaded to the RAID controller cache and is thus much more readily available (from both a bandwidth and latency perspective) than another read from the disk. This can drastically increase read performance. Unfortunately, the affected servers shipped with this feature disabled.
- Write Policy: When the operating system requests to write data to storage, the RAID controller will instruct individual disks to record the data and will calculate parity for redundancy. The operating system will wait for confirmation that the data has been written to disk before moving along to its next task. The RAID controller can optimize this process by writing to its local memory cache and reporting the write to the operating system, then writing the data out to disk on its own time. This option is safest to enable when a battery backup unit (BBU) or capacitor is available to maintain the (volatile) memory cache in the event of sudden power loss. This can drastically increase write performance. Unfortunately, the affected servers shipped with this feature disabled.
- Strip Size: When the RAID controller creates an array, it must be configured with the amount of data to write to each member disk. This is the “strip size,” which in turn informs the “stripe size” (the data laid across all disks, with parity calculation). Different values of strip (and therefore stripe) size lead to different performance characteristics. In general, larger strip/stripe sizes lead to better sequential performance, while lower strip/stripe sizes lead to better random performance. Cisco uses a balanced approach with a 128 KB strip size. Unfortunately, the affected servers shipped with a smaller 64 KB strip size, leading to somewhat suboptimal sequential performance.
Ultimately it all comes down to storage bandwidth and latency. The blog Coding Horror has an excellent and amusing explanation of “The Infinite Space Between Words” that relates to how long a disk read or write event takes from the perspective of the CPU and operating system. Anything that can be done to optimize read/write performance can have huge storage performance implications. Unfortunately, the affected servers suffer from misconfigurations that severely degrade storage performance.
I got a fever, and the only prescription, is more cowbell
Cisco’s field notice includes the corrective actions taken to resolve these issues, and breaks them into two options.
Cisco’s Option A is the easy stuff — fixing the read and write policies in the RAID management controller, which is non-destructive.
Cisco’s Option B is the hard stuff — fixing the strip size in the RAID management controller, which is destructive.
We have a plan
The serious performance issues described in Cisco Field Notice 64029 put the integrity of any Cisco UC solution using affected servers at significant risk.
While Cisco recommends corrective action Option A for all customers, they only recommend Option B for customers “that continue to experience issues after Option A has been implemented.” Ultimately, performing the steps in Option B will lead to peak system performance at the design level. This is important for any system with mission-critical applications or expected growth patterns taking it near design performance limits.
We have a plan.
CDW has documented the process to quickly and efficiently identify affected UCS servers via both Cisco’s serial number portal and the potentially affected UCS server’s Cisco Integrated Management Controller (CIMC) web GUI. CDW has also developed and tested engineering best practices to quickly and efficiently implement both Option A and Option B recommendations to rebuild affected UCS servers. The VMware ESXi host and all application VM settings are fully preserved. These practices are lab validated and already field tested.
Next steps and making lemons into lemonade
We cannot emphasize enough the importance of this issue: The serious performance issues described in Cisco Field Notice 64029 put the integrity of any Cisco UC solution using affected servers at significant risk.
It’s important for any Cisco UC customer to take some next steps to resolve these issues:
- Verify whether your servers are affected.
- If servers are affected, work with your CDW account manager to schedule services to perform a quick and efficient remediation.
- Consider an appropriate maintenance window to have CDW engineers perform remediation tasks based on your business needs, environments and applications.
While the system is “up on the lift,” so to speak, it’s a perfect time to consider co-scheduling other UC system maintenance, such as minor application upgrades, device package installations, server and phone firmware upgrades, and voice gateway software upgrades.
Contact your account manager for additional information. If you don’t have an account manager, use this form to get connected.