IT Operations Top 10 Sins

I’m sharing my list of Top 10 Sins in IT or IT Operations. This is from a process or procedure view. I have been in IT in varying capacities for 20+ years. Most of them as an IT Operations manager for both small and HUGE server environments primarily in the Communications Service(s) Provider sector. This was also posted on my Linkedin Profile here – https://www.linkedin.com/pulse/operations-top-10-sins-steve-white

 

10. Who’s on First? – Inventory, Owners, Users and Responsibilities
This sounds like a simple task to complete and maintain, however looks can be deceiving as the company increases with size or system count. Not only does inaccurate/missing inventory cause some serious head scratching during outages and prolongs system changes, unknown basic server metadata causes unnecessary financial expenditures for Support Groups, Licensing, Power, Physical (rack space) just to name a few.

Get your Physical inventory and R&R inventory under control and maintain it. This helps reduce the fog of war both near and far.

9. Deployment Standards and Hardware Standards- Automation Tools and Scripts
This is a double edge sword, Save time up front and on the back end/long haul. This seems obvious, I know, but this is a challenge for both small and large companies alike. Building nearly identical servers is a huge benefit for numerous reasons. Faster deployment, trouble shooting time, consistent changes using deploy tools or scripts, performance expectations, system reuse (hardware), parts sparing, the list of benefits goes on and on.

“Professional” level hardware from a vendor who produces the same hardware for years on end has the same benefit as choosing the same OS version for years on end. Some times its “hard” to buy a server that is nearly 2 years in design age, but constancy is paramount.

*I don’t mean you need to stay with the same processor speed or RAM count as an example, but keeping in the same Model number line or OS version.

8. Making the News, the bad way – Security scans PCI/SOX/AUDIT
When in IT, Staying out of the news is harder or should I say more important than being in the news… At least when in the news for the wrong reasons. It’s important IT leadership and team members understand what is at risk and what measures need to be taken to ensure “The doors are locked”. Quarterly external and internal scanning is critical to your IT health. These reports will not only provide you with a list of issues that need remediation and attention, but they will also (and possibly more critical) will show you what is facing the internet or the users environment you may not have been aware of. These may be systems that were turned up or changed to correct a temporary problem and have been forgotten about as time goes on.

PCI Scans, Audits and network discovery will help keep inventory accurate, System anomalies, patch levels, rogue systems, security, lost projects etc in check and on the radar. There are free tools you can run to provide these scans, and you can also pay for third party scans from various providers which may come with support regarding remediation and risk assessment reports.

7. Long lasting outages – Recovery/DR/HA implementation and testing
Long outages can cause issues well and truly beyond the actual outage. In most companies, if a system is worth building and presenting for usage it’s important and painful when down.

When a system fails, not only is the service down for an extended time (impact may vary based on users and pain), and the seed has been planted in the minds of the community where doubts and impressions have been formed outside of your control. Trust comes in may forms.

If it’s worth doing once…. it’s likely with doing twice……HA and DR come in MANY forms, and there is not “one” right way as budgets and reality are important to keep in design plans. BUT you cannot skip the DR/HA philosophy, even for the little systems when there are so many options these days that don’t require physical server purchases. If you don’t have the budget or need for physically diverse HA/DR locations with twin hardware behind load balancers, Generator and Battery UPS, you still have the responsibility to plan the DR steps, publish SLA’s for systems and practice outage recovery processes and communications.

7.5 Communications
I want to mention here, a little point I think we (IT) miss (or only complete partially) frequently. HA/DR many not be possible for many reasons. But for almost zero, you can provide email messages to the user community when services are down and ensure users the issues are being addressed. Redirect failed web apps to a “Maintenance / Outage” page, notify management of outage details (ETR, Root Cause, Next Steps) and build plans to improve notices/communications.

(This is likely an over exaggerated example) You know what it’s like to bump into a locked door when it’s always been open before….. It’s a similar unexpected disruption to the user community. Put a note on the Door! Give a heads up will ya!

6. Dreaming or beyond your budget – Financially or Functionally over extending
There is an old adage “Do more with less”, as much as this is true, so is the inverse, “You can do less with more”.

Often it’s very possible to make beneficial changes for free, and other times you can make detrimental changes for a cost. Don’t become enamored with technology and begin to feel you can only improve with spending or massive products. Some of the most amazing software is inexpensive or free. Additionally large complex software or hardware solutions that can truly only be vendor supported or requires dedicated people can become a funding and time sink. Think about how things will be supported in and over time.

Make sure you right size… for your size.

5. Crying Wolf – SLAs and setting expectations
You know, everyone feels their server or service is Business Impacting and Critical to the customer. However in reality, many of the programs and services that are in production were not relevant or important enough for high level (AKA Real) HA or DR expenditures during deployment. When every outage becomes a P1 Emergency outage you have set your self up for failure.

Now, I’m NOT saying that every outage is not important, however if a system was designed without redundancy, the business needs to be aware of the risks and the real world outages that can be expected, outage lengths and impact.

It’s important to evaluate and express real world MTR expectations for systems and services.

Additionally regular maintenance windows need to be established as part of the SLA. These need to utilized to keep the service up to date and healthy.

4. Complacency. – Always keep improving
Thinking you have a solid IT infrastructure/systems is a sure sign you are either at a company who has not changed in years and has no plans on changes or improvements, with a zero budget OR more likely you are unaware of what you need to do next.

There are ALWAYS things that can be done to improve IT / Systems. Even if you have one system, you can document the past, plan for the future, training/education, Outage prep, meet with business owners for feedback, rework contracts, improve monitoring, etc….

Don’t ever stop. “IT‘s Never Done” (Sounds like a T-shirt)

3. Not knowing what you should know. – Monitoring
This ranks high on the list as blindness is curable over the weekend. You have to know how and what the services are doing.

Simple up/down is not enough. Network, Hardware Layer, OS, Server Performance, Application Logs, Application transaction performance and growth/usage all need to be monitored.

“Normal” “Abnormal” and “Critical” are all alarm conditions and have different operational behaviors. Normal and Critical are obvious for most of us. The “Abnormal” alarm is one that takes real time and talent to produce.

Abnormal is typically a high rate of change or even a low rate level alarm. For instance Email. It’s easy to monitor for Email flow, and even message delivery times, storage space etc… But lets add another layer. message rates per min as an example. You can define what is “Abnormal” for email rates per day, hour, min or second. Too many emails, too few email messages. Similar are IO response times, Backup window length, Reboot/fail over time length etc…

Once your team is working on the “Abnormal” alarms you will know you are really ahead of the game and now have an “early warning” system in place.

2. Not knowing, what you don’t know you should know. – Experience
First off, I’m not stating you have to have a zillion years in IT to be successful. I am stating that if you have not seen it or don’t know about it, then providing the service with it will be very hard. This is true for anyone with any number of years in the field.

Basically, if you are not experienced in the management, repair or operational usage of something you are responsible for…. You better get on it. Not much worse that trying to fix things yo know nothing about…..Training, Test lab systems, Google, Vendor product Q/A, Outage Simulation. You have many ways to learn.

1. Wrong people in the wrong places. – Employee Talent / Attitude
There is not much worse than having the wrong people in the pool. A Company can be rendered immobile with ineffective, non passionate, under-talented with argumentative political posturing people in the mix. Make sure you staff up with passion and talent. Then don’t let the passion and talent go bad on the vine by surrounding them with bad grapes.

Slow, crippled IT Operations can also become a self fulfilling prophecy. When talented passionate smart people start to “feel” the Company is getting in the way of Day to Day work, they often feel individual efforts are rebuffed and unappreciated or impossible. If you make it hard to get work done, you will find the hard workers will go elsewhere and you are left with the remainder.

If you have smart and hard working people, you need to make sure they don’t spend too much time trying to get the work approved. Remove unnecessary obstacles, People and Processes can be viewed with the same critical eye when looking for the cause of slow operations.

What do you feel I missed on this? What should I add? Let me know in the comments!