Here at serveradmins.NET, we’ve handled quite a few different tasks for customers in the past. Several of these have been automating day to day tasks in large scale hosting operations. The thought behind this is that if you can save a technician even two minutes on a task they day 60 times a day, you’ve just freed up two hours of their day to handle other things. Now if you apply that to an entire tech operation of 20+ employees, you start to see the advantages quite quickly. Two hours per day per employee multiplied by 20 employees is 40 hours of “free” time you just created. That’s a whole work week right there! Chances are that no matter how slick your operations run, there’s always an opportunity to do *something* better and this is where the experienced admin can step in.
This is the difference between an Admin and a true Senior Admin. A Senior admin has been in the industry long enough to see a good way to do things a bad way to do things and sometimes simply a better way to do things. A true Sr. Admin will be able to look at your operations from the top down, break down the individual components and analyze each one for weaknesses, make and prioritize a list and then act on it.
For example, at a prior clients site, we were brought in to streamline overall operations and “fix things”. We initially started off by looking at the public facing problems and digging down from there. After a bit of recon, we noticed that server restoration times were abysmal, server load averages were way too high across the board and a VERY high failure rate of machines. Way above the norm. This not only caused the obvious direct impact of un-happy customers and complaints, but quite a few side effects as well. Support technicians spent an inordinate amount of their time keeping customers happy. Admins spent way too much time watching server restorations. Billing had an insane spike in chargebacks and cancellations. Unaffected customers got caught up in the flood of support/billing requests and had their problem resolution time skyrocket. Loads were greatly increased on the backup servers on average, which meant normal backup operations for the non-broken machines went over into business hours which caused higher loads via i/o wait on the non-broken machines, etc.
We crafted our plan of attack by looking at the most frequent cause of full server crashes. In this case it was that there was no monitoring on any of the disk arrays in the hosting machines. One drive would fail and the machine would keep operating and then 6-7 months later, the next disk would go causing a full crash of the server and loss of all data. We audited the hardware, did a large scale sweep for broken arrays and array status and found quite a few alarming issues. At least 9 machines with a single failed disk in the array, 3 machines operating on raid 0 raid arrays and several machines with no raid and ailing hardisks! These were all catastrophes just waiting to happen so that’s where we started. Disks were replaced, NRPE RAID checking was put into place so we could be informed of drive failures and act on them immediately. One fix for quite a few problems.
Where I’m going with that example is that you should always be aware of not only the obvious effects a problem manifests, but all of the other problems that stem out from there. After that single fix, we moved onto server load and capacity guidelines, then onto properly defining what an ‘abusive’ customer was and putting a stop to that, etc.
These were all aspects of the tech side of things. For an operations perspective, let’s look at another problem we tackled. We noticed during the initial recon that there were thousands of suspended accounts on the machines. After asking around a bit, we discovered that old account removal was done by hand by an outsourced support staff that was paid monthly for every server they supported. These accounts had been building up for years, eating up backup server space, live production server space and preventing the re-use of pre-existing machines. With a few small shell scripts, we were able to fully remove cancelled, non-pay etc customers on a nightly basis without any manual intervention. A *very* simple fix that freed up a tech from “pruning” the servers only when they ran out of disk space, allowed thousands of new accounts to be placed on existing hardware dramatically cutting new hardware costs for several months and lowering new equipment costs for the entire timeline of the company. Small fix, big savings from a money standpoint.
Domain registration was another issue at this company. A high volume shared hosting operation was registering every domain by hand, one at a time. Sometimes signups exceeded 7-800 a day! This required several techs to do nothing but sit there and register domains all day which if you ask me is a pretty thankless job. With a simple modification to their Ubersmith instance, that was fully automated and 2-3 techs were freed up to alleviate workloads from the main support and phone support systems.
When you handle anything on a large scale, the best fixes are usually simple fixes. As I mentioned at the opening of this blog article, if you save a tech 2 hours a day and then scale that across a week, then a month then a year, you can start to see the cost savings this will net you.
We’re in an industry powered by incredibly intelligent people trying to show their skillset. All too many times I’ve seen an admin spend hours and hours of his day crafting a super complex fix to a simple problem as opposed to taking the quick fix and moving on. Sometimes an issue *does* require a very well crafted complex solution, but in my time doing this 95% of the issues that assault a company are very simple workflow or operations problems that tend to compound and build on eachother.
I hope you can take something from this and apply it to your operations. Remember an hour a day adds up pretty quickly. 🙂