Three of the most common problems you’ll encounter are: long system uptime, manual restarts and dated processes. This post will help you address them and strengthen your automation efforts.
Long System Uptime
Long system uptimes are indicative of missing maintenance, neglect and potentially critical operational exposure. System admins used to take a lot of pride in how long individual systems remained running but make no mistake, it is a red flag. Today’s systems need to be able to be patched, rebooted, reconfigured and scaled. Automation needs to take failure into account so systems can self-heal.
When running any IT infrastructure, all the layers are subject to this issue. The network is typically a glaring problem child as IT operators rarely look to perform regular system software upgrades. We are all used to monthly Windows updates, why would the network be any different? It needs software updates too. As automation works its way into the network, it can mean there is an increased rate of change within the network, exposing old bugs, memory leaks and other inconsistencies.
Here are some tips for working out the kinks around uptime:
- Schedule reboots for any systems up more than 180 days. Plan and prepare to take a reboot at the time of your choosing instead of being the victim of a failure.
- Reboot systems before doing any upgrades. This will help you find out if there are problems before the upgrade, where an upgrade would have complicated recovery.
- Potentially look at rebuilding systems that have long uptimes so you can validate the design and prepare yourself for an Infrastructure as Code (IaC)-driven world. The rebuild process will help with documenting the current known state.
One easy metric for your IT environment is to survey and see what systems require manual restarts. You can look to see which systems come back and what their importance is to your environment. Systems that cannot be rebooted safely are not necessarily safe to automate. Resilience is a key feature of proper automation; hence it is important to fix those systems that are structural yet fragile.
The numbers and ages of systems that can only be manually rebooted might also indicate something systemic about how your staff deploys applications. You may have staff who’s default mode is that when a problem shows up they develop a reflex to manually work around the issue rather than fix it. The end result of these manual work-arounds is that they become hidden, increase cost and slow down work in progress.
My one recommendation is to form teams to scope, fix and then report on their progress to address applications and systems where they require manual restarts today. Shining a light on these problem areas will help underscore the opaque risks residing just below the surface of your environment.
Resistance to New Processes
Automation is going to require rethinking how your teams operate. You might have different naming conventions, security policies or application design decisions for different teams, any of which could be what they are for myriad reasons. Letting a “We’ve always done it this way” attitude get in the way of innovation can cause serious issues with getting bogged down. Are the people who made the decisions or policies with you anymore? Do you still need to handcraft certain parts of your product or process? IT handcrafting is the antithesis to scalability, consistency, repeatability and other positives you really can only get through automation. Use the opportunity to figure out how you can apply targeted automation with controls, with a vision toward eliminating legacy choices and reducing technical debt.
Uptime, restarts and old process are just the tip of the iceberg when it comes to roadblocks on the path to automation. Do not let them hold you back. Once you begin tackling these operational behaviors, your teams will feel more empowered. Nothing says that your organization is an innovative and exciting place to work more than having stable systems and being able to take a vacation or two.