Set It And Forget It, What Could Go Wrong?

Sifting through the tickets over my morning cup of coffee I spot one reading “Computer beeps and won’t turn on”, commence eye rolling. Finishing my coffee I decide to swing by that building first and take a peek at what could be wrong. Twenty minutes later I walk in and sure enough after hitting the power button then fan sounds like a Harrier getting ready to take off followed by 5 loud beeps, rinse repeat. I crack open the machine which is at least 6 years old, well past retirement age, and I see dust dinosaurs! After blowing out the computer(with a lot of canned air) and re-seating the RAM, reminds me of the OG Nintendo days, the system boots. I do a quick check and make sure things weren’t too corrupt and close out the ticket. That week I see and fix 3 computers with the same issue, so I ask “When was the last time these machines were cleaned out?”, no one could remember(so at least 3 years).

Fast forward a couple months and we start seeing a rise in Low Disk Space messages on laptops, so after cleaning up a few with a tool set I had collected over the years and some tweaks to system tools I get some space reclaimed(mostly temp files and Windows Update cache). Again I ask, “When was the last time these machines had been cleaned up or re-imaged?”, the answer was the same as before. So I take to our new RMM and get some alerts setup, to my surprise this happens more often the we have been hearing about. Time to script, I put together a cleanup script in Powershell and then shoehorn it into Python so our RMM can handle it and set an automated procedure to tackle these when the happen. This dramatically cuts down of the number of tickets we get for low disk space but now we’re getting calls about systems being slow, mostly due to the cleanup scripts running frequently.

Now this scenario isn’t unique to just IT, or just a specific type of business. It’s become common to hear that there’s; not enough ROI to perform routine maintenance on end user systems, or we can’t loose productivity, or my favorite “we’ll deal with it when it happens”. If you’ve worked in a service roll(particularly within an entity) you’ll know that the “we” is “you” and that deal with it means the possibility of getting chewed by people on both sides.

While in the Navy was was introduced to the PMS(Preventative Maintenance System) where maintenance was set by the engineering team that designed those systems and failing to do said maintenance was the exception and going around the system was punishable. That being said I don’t think we need to go that far in most cases(military and medical excluded).

For every decision there is a cost, to do regular(preventative) maintenance or not. Doing so might take away from productivity and might irk some end users, but not doing it might cost more than inconvenienced users, it might mean lost time and money. Not cleaning out those machines or cleaning up temp files means lost productivity time that is measurable, and might even lead to lost data(sorry, but since you didn’t back up those files they’re gone for good, and no I’m not a magician who can just bring them back). The question I ask is if the cost of not doing these regular tasks is worth the cost of a failure, if it is great, if not then it’s time to find another option. Take for example my decision to go to an email service provider instead of running my own server, after hitting a major upgrade to the software and the time to spin-up a new server and cost in terms of time for me to move everything over, I decided the cost in time and lost data wasn’t worth the hassle of maintaining and running my own server.

Evaluating costs isn’t just in terms of money, but in time and materials lost, and those can be hard to quantify until something actually fails not to mention showing that to superiors. Some things you can do are to automate as much of that maintenance as possible. System admins and programmers are familiar with this. Take the cleanup script I wrote, it might have taken me an hour or two to compile the resources, research and test, and then deploy the script, but it’s saved me hours a week in cleaning up systems so I can deal with “I can’t open the Internet!” questions.

At the end of the day we’ll be there to cleanup either way but there are a few things we can do to make life easier. Remember if you’re going to propose a maintenance scheme up the chain present the ROI, and where possible modularize and automate those tasks so you can kick back with a cup of coffee(or can of Mountain Dew) and read through the next set of tickets without thinking “not again”.

This is a new blog so I would appreciate any feedback you have on this and future posts.

One thought on “Set It And Forget It, What Could Go Wrong?

  1. I find that “we’ll deal with it when it happens” is often coupled with “this is just a temporary fix” that justifies poor work that often ends up being a permanent fixture. Time and thoughtfulness up front saves time and soooo much stress down the road!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s