On October 14, 2015, Cisco released Field Notice 64029 with updates on November 3, 2015. Here’s everything you need to know about this issue that affects Cisco UCS C220 M4 and C240 M4 servers. The field notice addresses and expands upon Cisco Bug CSCuv33991

What You Should Know About Cisco Field Notice 64029 for Unified Computing Systems
Cisco publishes field notices to inform customers and partners of upgrades, workarounds or other customer actions to address critical, significant, non-security-vulnerability product issues. These words — the generic boilerplate that years of blindly accepting EULAs, clicking through legalese, etc., has made into a haze — may not mean much to you, but for many Cisco UC customers, they have recently gained critical import.

Lie down on the couch and tell me all your problems

For starters: Cisco UC deployed on these servers, the Trusted Reference Configurations (TRCs) used by most deployments, can face severe performance issues. The possibly affected models include:

  • UCMBE7K – BE7H-M4-K9= (based on UCS C240 M4SX)
  • UCMBE7K – BE7H-M4-XU= (based on UCS C240 M4SX)
  • UCMBE7K – BE7M-M4-K9= (based on UCS C240 M4S2)
  • UCMBE7K – BE7M-M4-XU= (based on UCS C240 M4S2)
  • UCMBE6K – BE6H-M4-K9= (based on UCS C220 M4S)
  • UCMBE6K – BE6H-M4-XU= (based on UCS C220 M4S)
  • UCMBE6K – BE6M-M4-K9= (based on UCS C220 M4S)
  • UCMBE6K – BE6M-M4-XU= (based on UCS C220 M4S)

 

All such servers shipped from the initial production starting in Fall 2014 through late July 2015 suffer from a RAID configuration error (an unfortunate mistake) that results in extremely poor disk performance.

Many Cisco UC applications are ultimately storage bound, and the nature of the RAID misconfiguration leads to exponentially worse performance as additional load is put on the disk array. This poor disk performance can lead to a laundry list of issues, especially under load. Cisco’s field notice specifically mentions issues with:

  • Fresh installs, upgrades, backups or call detail record exports that take several hours longer than expected (the bug ID clarifies taking six to seven times longer)
  • Dropped calls that occur during high call volumes, upgrades, backups or call detail record exports

While Cisco doesn’t call them out specifically, the poor storage performance can affect other storage-bound activities leading to these other possible issues:

  • Unexpected device registration drops
  • Application pauses that have unpredictable and negative effects on other applications (e.g., Contact Center)
  • Significantly increased execution times for bulk administration jobs, provisioning activities using AXL API calls and device firmware upgrades

These are serious problems that any affected Cisco UC customer should make plans to correct as soon as possible.

Let’s get physical, physical — with your storage and RAID configurations

At root, the misconfigurations at Cisco Manufacturing that led to the performance issues revolve around the RAID controller. It takes some pretty in-depth knowledge of storage systems to understand exactly what is wrong and “get” the impact, but let’s do the high-level overview.

Quick Catch-up: RAID stands for Redundant Array of Independent Disks. Many disks act as one logical storage device, leading to an increase in speed and/or redundancy. Cisco uses the level type RAID 5 to provide a mix of both speed and redundancy. The RAID controller “stripes” data across multiple disks in small chunks, along with a parity calculation to allow for continued operation and data reconstruction in the event of a single-disk failure.

 

C220 M4 - PD - VD RAID5
A high-level diagram of the RAID setup of a Cisco C220 M4-based BE6M or BE6H. The RAID setup of the C240 M4-based BE7M and BE7H is somewhat different.

 

The RAID controller (OEM by Avago Technologies, who acquired LSI in May 2014) can further optimize storage access with a few optional settings. The three relevant settings are (in a nutshell):

  • Read Policy: When the operating system requests to read data from storage, the RAID controller will instruct individual disks to retrieve data. The data in question may be on a single disk in the array, or may be spread across multiple disks (“striped”). Since a disk is a spinning metal platter, it is quite efficient for the read head of the disk, as it passes over particular sectors, to grab additional sectors on the assumption that the operating system may require additional sequential data. It is also quite efficient for the RAID controller to pull the entire data stripe containing the specific requested data off of all disks in the array. This data is loaded to the RAID controller cache and is thus much more readily available (from both a bandwidth and latency perspective) than another read from the disk. This can drastically increase read performance. Unfortunately, the affected servers shipped with this feature disabled.
RAID5 read ahead
A high-level diagram showing the impact of Read Policy: Read Ahead Always.

 

  • Write Policy: When the operating system requests to write data to storage, the RAID controller will instruct individual disks to record the data and will calculate parity for redundancy. The operating system will wait for confirmation that the data has been written to disk before moving along to its next task. The RAID controller can optimize this process by writing to its local memory cache and reporting the write to the operating system, then writing the data out to disk on its own time. This option is safest to enable when a battery backup unit (BBU) or capacitor is available to maintain the (volatile) memory cache in the event of sudden power loss. This can drastically increase write performance. Unfortunately, the affected servers shipped with this feature disabled.
RAID5 write back
A high-level diagram showing the impact of Write Policy: Write Back with BBU.

 

  • Strip Size: When the RAID controller creates an array, it must be configured with the amount of data to write to each member disk. This is the “strip size,” which in turn informs the “stripe size” (the data laid across all disks, with parity calculation). Different values of strip (and therefore stripe) size lead to different performance characteristics. In general, larger strip/stripe sizes lead to better sequential performance, while lower strip/stripe sizes lead to better random performance. Cisco uses a balanced approach with a 128 KB strip size. Unfortunately, the affected servers shipped with a smaller 64 KB strip size, leading to somewhat suboptimal sequential performance.

Ultimately it all comes down to storage bandwidth and latency. The blog Coding Horror has an excellent and amusing explanation of “The Infinite Space Between Words” that relates to how long a disk read or write event takes from the perspective of the CPU and operating system. Anything that can be done to optimize read/write performance can have huge storage performance implications. Unfortunately, the affected servers suffer from misconfigurations that severely degrade storage performance.

I got a fever, and the only prescription, is more cowbell

Cisco’s field notice includes the corrective actions taken to resolve these issues, and breaks them into two options.

Cisco’s Option A is the easy stuff — fixing the read and write policies in the RAID management controller, which is non-destructive.

Cisco’s Option B is the hard stuff — fixing the strip size in the RAID management controller, which is destructive.

We have a plan

The serious performance issues described in Cisco Field Notice 64029 put the integrity of any Cisco UC solution using affected servers at significant risk.

While Cisco recommends corrective action Option A for all customers, they only recommend Option B for customers “that continue to experience issues after Option A has been implemented.” Ultimately, performing the steps in Option B will lead to peak system performance at the design level. This is important for any system with mission-critical applications or expected growth patterns taking it near design performance limits.

We have a plan.

CDW has documented the process to quickly and efficiently identify affected UCS servers via both Cisco’s serial number portal and the potentially affected UCS server’s Cisco Integrated Management Controller (CIMC) web GUI. CDW has also developed and tested engineering best practices to quickly and efficiently implement both Option A and Option B recommendations to rebuild affected UCS servers. The VMware ESXi host and all application VM settings are fully preserved. These practices are lab validated and already field tested.

Next steps and making lemons into lemonade

We cannot emphasize enough the importance of this issue: The serious performance issues described in Cisco Field Notice 64029 put the integrity of any Cisco UC solution using affected servers at significant risk.

It’s important for any Cisco UC customer to take some next steps to resolve these issues:

  • Verify whether your servers are affected.
  • If servers are affected, work with your CDW account manager to schedule services to perform a quick and efficient remediation.
  • Consider an appropriate maintenance window to have CDW engineers perform remediation tasks based on your business needs, environments and applications.

While the system is “up on the lift,” so to speak, it’s a perfect time to consider co-scheduling other UC system maintenance, such as minor application upgrades, device package installations, server and phone firmware upgrades, and voice gateway software upgrades.

Contact your account manager for additional information. If you don’t have an account manager, use this form to get connected.

 

One thought on “Cisco Field Notice for UCS servers – What You Need to Know

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.