Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
IT

Ask Slashdot: Unattended Maintenance Windows? 265

grahamsaa writes: Like many others in IT, I sometimes have to do server maintenance at unfortunate times. 6AM is the norm for us, but in some cases we're expected to do it as early as 2AM, which isn't exactly optimal. I understand that critical services can't be taken down during business hours, and most of our products are used 24 hours a day, but for some things it seems like it would be possible to automate maintenance (and downtime).

I have a maintenance window at about 5AM tomorrow. It's fairly simple — upgrade CentOS, remove a package, install a package, reboot. Downtime shouldn't be more than 5 minutes. While I don't think it would be wise to automate this window, I think with sufficient testing we might be able to automate future maintenance windows so I or someone else can sleep in. Aside from the benefit of getting a bit more sleep, automating this kind of thing means that it can be written, reviewed and tested well in advance. Of course, if something goes horribly wrong having a live body keeping watch is probably helpful. That said, we do have people on call 24/7 and they could probably respond capably in an emergency. Have any of you tried to do something like this? What's your experience been like?
This discussion has been archived. No new comments can be posted.

Ask Slashdot: Unattended Maintenance Windows?

Comments Filter:
  • Puppet. (Score:4, Informative)

    by Anonymous Coward on Friday July 11, 2014 @12:25PM (#47432007)

    Learn and use Puppet.

  • Re:Murphy says no. (Score:5, Informative)

    by David_Hart ( 1184661 ) on Friday July 11, 2014 @01:00PM (#47432357)

    Here is what I have done in the past with network gear:

    1. Make sure that you have a test environment that is as close to your production environment as possible. In the case of network gear, I test on the exact same switches with the exact same firmware and configuration. For servers, VMWare is your friend....

    2. Build your script, test, and document the process as many times as necessary to ensure that there are no gotchas. This is easier for network gear as there are less prompts and options.

    3. Build in a backup job in your script, schedule a backup with enough time to complete before your script runs, or make your script dependent on the backup job completing successfully. A good backup is your friend. Make a local backup if you have the space.

    4. Schedule your job.

    5. Get up and check that the job complete successfully either when the job is scheduled to be completed or before the first user is expected to start using the system. Leave enough time to perform a restore, if necessary.

    As you can probably tell, doing this in an automated fashion would take more time and effort than baby sitting the process yourself. However, it is worth it if you can apply the same process to a bunch of systems (i.e. you have a bunch of UNIX boxes on the same version and you want to upgrade them all). In our environment we have a large number of switches, etc. that are all on the same version. Automation is pretty much the only option given our scope.

  • by thecombatwombat ( 571826 ) on Friday July 11, 2014 @01:01PM (#47432369)

    First: I do something like this all the time, and it's great. Generally, I _never_ log into production systems. Automation tools developed in pre-prod do _everything_. However, it's not just a matter of automating what a person would do manually.

    The problem is that your maintenance for simple things like updating a package is requiring downtime. If you have better redundancy, you can do 99% of normal boring maintenance with zero downtime. I say if you're in this situation you need to think about two questions:

    1) Why do my systems require downtime for this kind of thing? I should have better redundancy.
    2) How good are my dry runs in pre-prod environments? If you use a system like Puppet for *everything* you can easily run through your puppet code as you like in non-production, then in a maintenance window you merge your Puppet code, and simply watch it propagate to your servers. I think you'll find reliability goes way up. A person should still be around, but unexpected problems will virtually vanish.

    Address those questions, and I bet you'll find your business is happy to let you do "maintenance" at more agreeable times. It may not make sense to do it in the middle of the business day, but deploying Puppet code at 7 PM and monitoring is a lot more agreeable to me than signing on at 5 AM to run patches. I've embraced this pattern professionally for a few years now. I don't think I'd still be doing this kind of work if I hadn't.

  • Re:Puppet. (Score:4, Informative)

    by sjames ( 1099 ) on Friday July 11, 2014 @05:27PM (#47434447) Homepage Journal

    How, exactly, do you snapshot and test the production VM before the maintenance window and guarantee you won't affect (and by "affect", I mean anything that changes behavior in any way that is not expected by the users) any services running on that VM?

    Clone it. upgrade the clone and make sure it works. If so, wipe the clone, snapshot the production VM and upgrade it. If it fails, roll back. Make sure your infrastructure is set up so the clone CAN be properly tested. Yes, sometimes you will have to do that rollback, but with an adequate test setup, frequently you won't.

I've noticed several design suggestions in your code.

Working...