Best Practices For Infrastructure Upgrade? 264
An anonymous reader writes "I was put in charge of an aging IT infrastructure that needs a serious overhaul. Current services include the usual suspects, i.e. www, ftp, email, dns, firewall, DHCP — and some more. In most cases, each service runs on its own hardware, some of them for the last seven years straight. The machines still can (mostly) handle the load that ~150 people in multiple offices put on them, but there's hardly any fallback if any of the services die or an office is disconnected. Now, as the hardware must be replaced, I'd like to buff things up a bit: distributed instances of services (at least one instance per office) and a fallback/load-balancing scheme (either to an instance in another office or a duplicated one within the same). Services running on virtualized servers hosted by a single reasonably-sized machine per office (plus one for testing and a spare) seem to recommend themselves. What's you experience with virtualization of services and implementing fallback/load-balancing schemes? What's Best Practice for an update like this? I'm interested in your success stories and anecdotes, but also pointers and (book) references. Thanks!"
Re:Cloud Computing(TM) (Score:5, Insightful)
No, the budget questions comes later.
The first questions are: What are your businesses requirements regarding your IT infrastructure? How long can you do business without it? How fast does something need to be restored?
Starting with those requirements, you can start with possible designs that fit those solutions - for example, if the requirement is that a machine must be operational at last a week after a crash, you can build computers from random spare parts and hope that they'll work. If the requirement is that it should be up and running in two days, you will need to buy servers from a Tier 1 vendor like HP or IBM with appropriate service contracts. If the requirement is that everything must be up and running again in 4 hours, you'll need backups, clusters, site resilience, replicated SAN, etc. pp.
The question of Budget comes into play much later.
Think about the complexity of duplication (Score:5, Insightful)
there's hardly any fallback if any of the services dies or an office is disconnected. Now, as the hardware must be replaced, I'd like to buff things up a bit: distributed instances of services (at least one instance per office) and a fallback/load-balancing scheme (either to an instance in another office or a duplicated one within the same).
Is that really necessary? I know that we all would like to have bullet-proof services. However, is the network service to the various offices so unreliable that it justifies the added complexity of instantiating services at every location? Or even introducing redundancy at each location? If you were talking about thousands or tens of thousands of users at each location, it might make sense just because you would have to distribute the load in some way.
What you need to do is evaluate your connectivity and its reliability. For example:
Once you answer at least those questions, then you have the information you need in order to make a sensible decision.
Get someone experienced on the boat! (Score:5, Insightful)
You know, you could've started with a bit more details - what operating system are you running on the servers? What OS are the clients running? What level of service are you trying to achieve? How many people work in your shop? What's their level of expertise?
If you're asking this on Slashdot now, it means you don't enough experience with this yet - so my first advice would be to get someone involved who does. Someone with many people with lots of experience and knowledge on the platform you work on. This means you'll have backup in case something goes south and your network design will benefit from their experience.
As for other advise, make sure you get the requirements from the higher-ups in writing. Sometimes they have ridiculous ideas regarding they availability they want and how much they're willing to pay for it.
Take your time (Score:5, Insightful)
If you're like most IT managers, you probably have a budget. Which is probably wholly inadequate for immediately and elegantly solving your problems.
Look at your company's business, and how the different offices interact with each other, and with your customers. By just upgrading existing infrastructure, you may be putting some of the money and time where it's not needed, instead of just shutting down a service or migrating it to something more modern or easier to manage. Free is not always better, unless your time has no value.
Pick a few projects to help you get a handle on the things that need more planning, and try and put out any fires as quickly as possible, without committing to a long-term technology plan for remediation.
Your objective is to make the transition as boring as possible for the end users, except for the parts where things just start to work better.
Re:I'd say (Score:2, Insightful)
I doubt with only 150 people they would want to spend the money to have a server at every office in case that offices link went down. I agree wholeheartedly that the level of redundancy talked about is overkill. Also will WWW, mail, DNS, ... even work if the line is cut regardless if the server is in the building?
Don't do it (Score:5, Insightful)
Complexity is bad. I work in a department of similar size. Long long ago, things were simple. But then due to plans like yours, we ended up with quadruple replicated dns servers with automatic failover and load balancing, a mail system requiring 12 separate machines (double redundant machines at each of 4 stages: front end, queuing, mail delivery, and mail storage), a web system built from 6 interacting machines (caches, front end, back end, script server, etc.) plus redundancy for load balancing, plus automatic failover. You can guess what this is like: it sucks. The thing was a nightmare to maintain, very expensive, slow (mail traveling over 8 queues to get delivered), and impossible to debug when things go wrong.
It has taken more than a year, but we are slowly converging to a simple solution. 150 people do not need multiply redundant load balanced dns servers. One will do just fine, with a backup in case it fails. 150 people do not need 12+ machines to deliver mail. A small organization doesn't need a cluster to serve web pages.
My advice: go for simplicity. Measure your requirements ahead of time, so you know if you really need load balanced dns servers, etc. In all likelihood, you will find that you don't need nearly the capacity you think you do, and can make due with a much simpler, cheaper, easier to maintain, more robust, and faster setup. If you can call that making due, that is.
Re:Why? (Score:3, Insightful)
> redundancy.
+5 Funny.
Trying to make your mark, eh? (Score:3, Insightful)
The system you have works solidly, and has worked solidly for seven years.
I, personally, am TOTALLY in agreement with the ethos of whoever designed it, a single box for each service.
Frankly, with the cost of modern hardware, you could triple the capacity of what you have now just by gradually swapping out for newer hardware over the next few months, and keeping the shite old boxen for fallback.
Virtualisation is, IMHO, *totally* inappropriate for 99% of cases where it is used, ditto *cloud* computing.
It sounds to me like you are more interested in making your own mark, than actually taking an objective view. I may of course be wrong, but usually that is the case in stories like this.
In my experience, everyone who tries to make their own mark actually degrades a system, and simply discounts the ways that they have degraded it as being "obsolete" or "no longer applicable"
Frankly, based on your post alone, I'd sack you on the spot, because you sound like the biggest threat to the system to come along in seven years.
These are NOT your computers, if you want a system just so, build it yourself with your own money in your own home.
This advice / opinion is of course worth exactly what it cost.
Apologies in advance if I have misconstrued your approach. (but I doubt that I have)
YMMV.
What 150 users? (Score:5, Insightful)
I'd say that everyone has mentioned that big picture points already, except for one : what kind of users?
150 file clerks or accountants and you'll spend more time worrying about the printer that the CIO's secretary just had to have which conveniently doesn't have reliable drivers or documentation, even if it had what neat feature that she wanted and now can't use.
150 programmers can put a mild to heavy load on your infrastructure, depending on what kind of software they're developing and testing (more a function of what kind of environment are they coding for and how much gear they need to test it).
150 programmers and processors of data (financial, medical, geophysical, whatever) can put an extreme load on your infrastructure. Like to the point where it's easier to ship tape media internationally than fuck around with a stable interoffice file transfer solution (I've seen it as a common practice - "hey, you're going to the XYZ office, we're sending a crate of tapes along with you so you can load it onto their fileservers").
Define your environment, then you know your requirements, find the solutions that meet those requirements, then try to get a PO for it. Have fun.
Re:Cloud Computing(TM) (Score:2, Insightful)
Yes, but for example management wanting 24/7 2 hour up&running SLA and having hired a single guy with a budget of 800$ will not work - this is important to get sorted out early. Management needs to know what they want and what they'll get.
Simple and straightforward = complex (Score:5, Insightful)
So let's see if I understand: you want to take a simple, straightforward, easy-to-understand architecture with no single points of failure that would be very easy to recover in the event of a problem and extremely easy to recreate at a different site in a few hours in the event of a disaster, and replace it will a vastly more complex system that uses tons of shiny new buzzwords. All to serve 150 end users for whom you have quantified no complaints related to the architecture other than it might need to be sped up a bit (or perhaps find a GUI interface for the ftp server, etc).
This should turn out well.
sPh
As far as "distributed redundant system", strongly suggested you read Moans Nogood's essay "You Don't Need High Availability [blogspot.com]" and think very deeply about it before proceeding.
Confuscious Say.. (Score:1, Insightful)
..if it aint broke..
Re:Google(tm) Cloud (Score:3, Insightful)
It is if you recommended outsourcing everything to the cloud.
Re:Cloud Computing(TM) (Score:4, Insightful)
Except of course that management ALREADY HAS that because they've been very lucky for 7 years. Why spend money for what works (never mind we can't upgrade or replace any of it because it's so old)
I think what the article is really asking is what's a good model to start all this stuff. Your looking at one or two servers per location (or maybe even network appliances at remote sites) We read all this stuff on Slashdot and in the deluges of magazines and marketing material...where do we start to make it GO?
Re:Trying to make your mark, eh? (Score:3, Insightful)
Is it so hard to not mix up dhcpd.conf and named.conf? Do you need virtualization for that?
Let me give you a hint: YOU DON'T
Re:Cloud Computing(TM) (Score:3, Insightful)
Why would you buy a cluster not the same architecture? You don't know what you're talking about. VMs generally aren't used to change architecture like that. In a Virtualized Cluster the "OS" is just another data file too! Just point an available CPU to your file server image on the SAN and start it back up... that's smart, not lazy!
Most people need virtualization because managing crappy old apps on old server OSes is a bitch. The old busted apps are doing mission critical work, customized to the point the manufacture won't support them and management doesn't want to pay out for the new version... or the new version doesn't support the old equipment. The leading purpose for VMs is to get new shiny hardware with a modern OS and backup methods to segregate your old hard to maintain configurations to instances. Then the old and busted doesn't crash the core services anymore. Instances that used to be on dedicated, busted hardware that used to require a call-out can be rebooted from your couch in your jammies! (I vote VNC on iPhone as thee killer admin app!) VMs include backup at the VM level, so those old machines that refused to support backup can be backed up "in spite of" the software trying to prevent it.
Re:Trying to make your mark, eh? (Score:3, Insightful)
Get real, for 150 users at WRT54 will do DNS etc....
Want a bit more poke, VIA EPIA + small flash disk.
"buy a server".. jeez, you work for IBM sales dept?
I'm responding to your comment:
I, personally, am TOTALLY in agreement with the ethos of whoever designed it, a single box for each service.
I recommended at least two boxes, for redundancy. He may need more, depending on load.
For a 150 user organization, that's nothing, most such organisation are running off a dozen servers or more, which is what the original poster in fact said. With virtualization, he'd be reducing his costs.
One per service is insane, which is what you said. If you wanted dedicated boxes for each service AND some redundancy, that's TWO per service!
Backpedaling and pretending that a WRT54 can somehow host all of the services required by a 150 user organization is doubly insane.
Astroturfing.. (Score:3, Insightful)
If MS is going to astroturf, you need to at least learn to be a bit more subtle about it. That post couldn't have been more obviously marketing drivel if it tried. Regardless of technical merit of the solution (which I can't discuss authoritatively).
The post history of the poster is even more amusingly obvious. No normal person is a shill for one specific cause in every single point of every post they ever make.
To all companies: please keep your advertising in the designated ad locations and pay for them, don't post marketing material posing as just another user.
Comment removed (Score:5, Insightful)
Re:Why? (Score:2, Insightful)
It creates a configuration nightmare. Apps with conflicting configurations.
Changes required for one app may break other apps.
Also, many OSes don't scale well.
In a majority of cases you actually get greater total aggregate performance out of the hardware by divvying it up into multiple servers. When your apps are not actually CPU-bound or I/O bound.
Linux is like this. For example, in running Apache.. after a certain number of requests, the OS uses the hardware inefficiently, and can't answer nearly as many requests as it should be able to. By dividing it into 4 virtual servers instead, for your 4 CPUs, you can multiply the number of requests that can be handled by 10 or 20 fold.
You may even think your CPU bound on Linux when you are not: load average may be high due to number of Apache processes that are contending with each other, and can create a false impression of high CPU or IO usage, when in fact, you have a bottleneck in the app/kernel's parallel processing capabilities.
Exchange is also like this.. better to scale out 2 virtual machines with 32gb a RAM and 4x3ghz CPUs dedicated to it each, than one server with 64gb RAM and 8x3ghz CPUs. The former is a beefy server but doesn't have much advantage from adding the extra resources. The two servers virtualized on one box may have much better performance than 1 physical server, if you are using Intel Nehalem CPUs and properly configure your VMs (i.e. you actually do it right, and perform all recommended practices including LUN/guest partition block alignment, and don't just use default settings).
Re:Get someone experienced on the boat! (Score:2, Insightful)
The main piece of missing information that annoys me is that part of the network service list that says "-- and some more." Half the services that were listed could be easily outsourced to any decent ISP, with cost depending on security, storage, and SLA requirements. ISP hosting or even colocation services give you cheap access to better redundant Internet links than your office will ever touch.
The other half could be done with a cheap firewall/VPN box at each site. In the age of OpenWRT, these boxes often have services like Multi-WAN, DNS, DHCP, SSL, VPN, and IDS built-in. Buy two of those, sync configuration, hook them up to a networked power switch, and script the power to shut off one and power up the other whenever a network service test fails. All that equipment is still less than the cost of a single 1U+ server with equivalent services, and any custom scripting would be for minor convenience functions -- not a service requirement. I find specialized hardware/firmware solutions are far more reliable than software/server solutions. They are also often cheap enough to keep an offline spare handy for emergency replacement.
Even a low-power retail NAS box could be used for complete network authentication, SSH, and SSL data services. It could probably serve an office up to 250 users, depending on simultaneous load -- 50 easy. Slap some cheap (less than $0.10/GB!) TB+ SATA drives in there, and you have multi-TB RAID storage per site, that can be rsync replicated to all nodes. Give each site their own cheap master storage node, replicated to each other. The rsync script(s) could be scheduled or event triggered, as needed. Netgear ReadyNAS boxes can also run Subversion/WebDAV/Autocommit/svnsync.
I'm betting the meat of these services are in that nebulous "and some more" area, and that those service requirements change everything.
Some brand names that carry one or more of the products mentioned above, and can be found in any Fry's or decent online store, without even having to deal with a sales rep:
Netgear
Linksys (now sometimes Cisco rebranded)
Dlink
Cradlepoint (3G/4G wireless backup!)
Apple (Airlink are surprisingly good routers!)
Qnap
Thecus
Sans Digital
Digital Loggers, Inc.
APC
I wouldn't ever recommend Buffalo, and 3Com might be on the list if HP had not bought them recently.
Re:Cloud Computing(TM) (Score:3, Insightful)
Except of course that management ALREADY HAS that because they've been very lucky for 7 years
Whoa there - so using this logic we can assume the company has no fire insurance, etc, because they've been lucky and not had their building burn down in 7 years? Managers might not understand technical issue but one thing managers worth the title CAN do is manage risk ie: balance cost of risk mitigation against risk. I can well imagine a company of 150 people that actually doesn't have any mission critical servers worth spending a lot on redundancy, etc. I can also imagine a company that has gotten lucky while at the same time, the IT person(s) haven't explained IT risks/costs in proper terms because they assume the managers just aren't technical.
The original questioner definitely needs to do a proper risk / cost analysis and present it to the managers. (But right now his "ideas" are WAY too vague and not business need driven) A prompt, proper analysis and plan/alternate plan(s) for risk and risk avoidance is going seriously wanted. It will CYA for that magic moment any day now when these 7 year old systems start failing.
Re:Cloud Computing(TM) (Score:2, Insightful)
Well, it's not like i've not got any clashes with out management (or even that of some of our customers), but i've found it to be the better approach to actually talk things through in the hope of getting a better understanding of both parties.
From my technical standpoint, it's very much important that management _exactly_ knows what they're getting for their money. This also means saying "No". Yes, you can lose customers if you don't promise them 99.999999999% availability for 50$ a month, but the real question if you actually wanted those customers in the first place - they may find some idiot which agrees to work with them, but that's their loss.
Even internally, as we also run our internal infrastructure, it's important to say "no" to unreasonable tasks and stupid ideas. Either management trusts you to actually do the job they've hired you to do, or they don't - then you'll need to find a new job.
Re:Cloud Computing(TM) (Score:2, Insightful)
mod parent up.
The first step is to find out what the business wants, and how much it is willing to pay. THEN you go out to find out what tech is appropriate/affordable to do it.
Ask the heads of each office, and the main business managers what they want the tech to do now, in a year and in three years. Do you have a business continuity plan that has to be allowed for. If you don't have a BC plan, now's a good time to have one done, before you buy a load of kit that may not do the job.
Once you have a list of business needs, and put them in a prioritised list (again the managers set the priority), you go out and look at what can do the job. Assuming you find a reasonable solution within budget, you need to plan the migration.
Protip: do not attempt to migrate everything in one go. Do it in steps, with breaks in between.
Proprotip: whatever your migration, be able to revert to the original solution in less than 8 hours - ie one working day.
Migration is the biggest gotcha - plan, plan and plan again. Do a dry run. Start with the least critical services. You do have backups, right? Fully tested backups, from ground zero? You do have all your network and infrastructure accurately and completely mapped out, and all configuration settings / files stored on paper and independent machines?
Both arguments for VM and KISS have their place - only you can decide. But when you do decide, make sure it's based on evidence, and will end up making the business better.
Don't forget Total Cost of Ownership - the shiny boxes may run faster, but will you have to hire two more techs to keep them running, or a new maintenance contract?
Don't forget training - for you, your staff and the end users. If you're putting shiney newness in place, people will need to know how to use it, and do their jobs at least as quickly as on the old solution. No use putting in shiny web4.0 uber cloud goodness, if the users end up spending an hour doing a job that used to take 5 minutes, because they don't know how to use it properly, or the interface doesn't easily work with their business processes.
good luck
Many datacenters can't build out bladecenters (Score:3, Insightful)
The biggest problem I've found with blades is that you can't fill a rack with them. Several of the datacenters I've come across have been unable to fit more than one bladecenter per rack. Cooling and power being the problem.
At the moment. A rack full of 1U boxes look like the highest density to me.