How long would you drive your brand new car around with the “Check Engine” light flashing on your dashboard? One hour? One day? One week? One month?
My guess is somewhere between one hour and one day – as hey, you paid a lot of money for that car so it sure isn’t going to let you down anytime soon.
So why would you ignore the “Check Engine” light on your critical IT infrastructure? Truth is you are not alone. Many companies out there face the same issues day in day out, there is never enough hours in the day to carry out the never ending wish list your boss throws at you as well as keeping your eye on your “Check Engine” light. You take the “She’ll be right” approach.
Over the past couple of weeks three of our unmanaged customers have experienced major application outages. What makes this worse is that each of these outages could have been easily avoided. In all three cases their “Check Engine” light had been flashing for some time now. Problem was no one realised it or in one case chose to ignore it.
The purpose of this blog is not to name and shame but to highlight to every CxO out there who has even a small investment in technology the importance of proactive system maintenance.
Here are my five Top Tips and how we have implemented them for our managed customer
1. Have a third party application monitor your critical infrastructure. Start with the simple things like uptime, disk space, memory utilisation and critical services. Get this right then expand from there. Most of these monitoring applications can also fire off remediation tasks automatically – for example a critical service has stopped, let the tool restart this service for you. If it happens again within a predefined time create a task for someone to look further into the cause.
- We use a Remote Management and Monitoring (RMM) tool called LabTech. There are heaps of these on the market, but we chose this tool because of its tight integration to our core CRM and Service Management tool ConnectWise.
- Every application that we manage for our customers is monitored by LabTech. An agent is installed on the server and reports back defined information and events centrally
2. Set your alert thresholds to the right level. Too much noise creates distractions, too little noise increases your chance of missing something.
- Most of the processes are automated once the agent is installed. The thresholds are set globally dependant on the application, and most of the heavy lifting is done by the agent, not a human.
- This isn’t necessary a set and forget approach. We invest the time that we save back in automating more of the alerts and actions. It’s also a case of as new functionality or new services hits the market that we tune our thresholds accurately.
3. Make someone accountable for the alerts your monitoring tool creates. There is no point having your check engine light flashing if no one can see it.
- Internally we call this person the “Duty Manager”. This role can change from day to day and sometimes partway through the day – but importantly everyone knows that you are performing this role.
- They are responsible to check in on what has happened over the past 24 hours and what could have been prevented. They are also responsible for actioning any alerts that couldn’t be resolved automatically.
- Their goal is to have zero downtime for our customers, and are rewarded appropriately.
4. Start patching your systems on a regular basis. Every three years when you replace the application is not good enough. You should be applying operating systems patches at a minimum every three months, but preferably every month. Have a defined release and testing process in place. Don’t forget about the devices your users use either, desktops, laptops, tablets, phones. They all need regular maintenance and patching.
- For our customers we apply Windows Updates monthly and major application updates quarterly.
- Why so frequent you ask? Because we can, and because this is the right thing to do. Most of our agreements only require this to happen either quarterly or 6 monthly. But over-delivering on this side, actually saves us time troubleshooting issues later. It also keeps all our customers at a consistent patch level, again saying precious troubleshooting time. It is much easier to support your clients when they are all on a defined software level. That being said, if our release cycle (listed below) identifies a critical issue – we discuss remediation with the customer or workarounds.
- We have a defined release cycle too – this is understood by every single employee in our company without fail.
i. 2nd Tuesday of the month. Patches are released.
ii. 48 hours after, three of our team meet, review each KB article and agree which patches should be applied (Not everyone makes it through - .NET ones always create interesting debate)
iii. 3rd Tuesday of the month – The approved patches are applied to the Insync Production environment. Yep that’s right - we test on our prod environment, we risk taking it down every month for the sake of running another test environment. This includes application servers, front end and back end services, phones and desktop level patches. It is paramount that we understand any issues before our customers. We try to model our infrastructure on our typical customer.
iv. 4th week of the month. The updates are applied to our customer’s environments. Although this is mostly automated, we still manually check every service is functioning as expected afterwards.
5. Report on your Infrastructure health. Set a benchmark for your system health across your organisation, then measure yourself against this. Don’t hide it, put it up on a wall that everyone can see, include it in weekly, monthly, quarterly reports to the boss. If it’s a bad score, don’t be ashamed – aim to be better.
- LabTech has the ability to score and report on the health of every customer that we manage. There is an algorithm that does this, that I won’t go into detail here, but it spits out a consolidated result out of 100. Our past experience tells us that anything over 80% is great. Anything under not so great.
- We report this score to our customers monthly. No smoke and mirrors, just a straight screen shot of the value – yes we could fudge this if we wanted, but why bother – honesty has not lost us a customer yet. It is actually quite interesting how seriously our customers take this score – a couple actually treat this as a competition with one another.
If you would like to discuss any of the above, or have a chat about your Engine Light, feel free to email or call me direct.
- Nathan Belling
- Email/Skype: email@example.com
- Phone: +61 7 3040 3601