Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Education

Distributing Unix Knowledge Among Admins? 56

chadworthman asks: "I work in a server support role with 6 other sys admins. We are all responsible for 10 to 25 servers each (various flavours of Unix), mostly grouped by project. The person who is responsible for a server is called prime. We also identify a sys admin as secondary. This system is not working out well. Most sys admins are only familiar with the environments that they are prime for, and when a prime contact is not in the office or leaves the company, the rest of us try to figure out the environment. We are currently trying to figure out the best way to transfer knowledge of environments between sys admins. We have considered a plan that would involve partnering with another co-worker while you trade knowledge, then after a certain number of months, trade with someone else. I was wondering what other techniques for knowledge transfer between sys admins Slashdotters have encountered."
This discussion has been archived. No new comments can be posted.

Distributing Unix Knowledge Among Admins?

Comments Filter:
  • standardize (Score:2, Insightful)

    by tps12 ( 105590 )
    You need to try to adapt each environment towards a single standard that everyone then becomes familiar with. Yes, this will sacrifice some features of each platform, but that is the price you pay for greater scalability and flexibility. This is the kind of thing that made different flavors of Windows so popular with sys admins, and it's high time the Unix world followed suit.
  • Documentation (Score:3, Informative)

    by drivers ( 45076 ) on Tuesday June 18, 2002 @02:13PM (#3723375)
    Documentation. It's what you need. Some standardization would probably help too.
    • Re:Documentation (Score:3, Informative)

      by TheTomcat ( 53158 )
      We use a Wiki Wiki Web for all of our internal system (servers, network, telephone, hardware, etc) documentation.

      PHP Wiki [sf.net]

      It's great. Simple to use, easy to update, can read and edit from any web terminal on this side of our firewall. Need to find out hardware info about the Mac in the corner office? Log into the hardware tracker directly FROM said Mac.

      S
  • Document (Score:3, Insightful)

    by photon317 ( 208409 ) on Tuesday June 18, 2002 @02:14PM (#3723385)

    Set up a knowledgebase of system information, make it versionable, and perhaps commentable in blogs-style. Make it publish to a departmental web server, and have everyone document the hell out of everything there. Things that go there:

    Invetories of systems and of software
    Licenses and whatnot
    Purchase info
    Common practices docs (disk layout procedures, installation procedures, patching procedures, downtime procedures, etc)...
    etc...etc..

    You get the idea. The company shouldn't be reliant on an employee's brain as part of their business plan - document everythign in such a way that if the whole staff went missing, a new staff of competent unix professionals could take over and do somethign useful based on your web docs.
    • Re:Document (Score:3, Interesting)

      by geirt ( 55254 )
      We have just started to use a wiki [c2.com] for this purpose, and it looks good. We use MoinMoin [sourceforge.net], a wiki written in python which does versioning, and can send email notifications when a page is updated. People are documeting much more than before.
    • In particular, have your admins write down

      • problem symptom calls they receive
      • what they did to refine the diagnose and isolate the cause
      • exactly what they did to fix it (I can see where SysV and BSD camps each need for the other to be explicit about things they each take for granted.)

      Finally, because a system like this depends on the cooperation of the sysadmins involved to be complete and detailed about this documentation, you need to heap frequent praise and monetary rewards on those that do a good job documenting their work.

      Make sure they understand that if a secondary sysadmin is able to keep their system going while they're on vacation that it is not a sign of job insecurity, but a sign of a first class sysadmin that makes sure his company keeps running even if he's on vacation.

  • Rotate your primes (Score:4, Interesting)

    by LordNimon ( 85072 ) on Tuesday June 18, 2002 @02:19PM (#3723407)
    Every two months, the primes all rotate. After a year, you will all be experts on all systems.
    • Halelujah! Amen brotha . .

      We've tried the exhaustive documentation route here before, and it became pretty cumbersome before long. Describing all the symptoms and all the steps taken to correct something is undeniably a good idea, but in the heat of the moment, can be pretty darned tough.

      This solution benefits both the organization (because everyone is cross-trained), and benefits you all as the admins - as you will have deeper experience on a wide range of unices.

      An excellent solution.
    • by metacosm ( 45796 )
      The solution presented in the parent post is the correct one. Documentation is fine and dandy, but it doesn't come close to experience. Great thing about rotating primes is that the "true blue expert" is still around, but he is off learning something new too. Everyone puts their ego on hold and pitches in to help. This will generate well rounded techs that can handle a broader array of issues, as a group and independantly.
      • Often, when writing documentation we leave out things because we take them for granted or "well, it's just like that." Other people without that experience don't know that, and the gaps become apparent QUICK.

        Documentation AND experience. One can't take up the slack for the other being weak.
  • Look! (Score:3, Insightful)

    by itwerx ( 165526 ) on Tuesday June 18, 2002 @02:22PM (#3723425) Homepage
    It's a bird!
    It's a plane!
    It's a flock of binders!

    But seriously, documentation is key to anything like this. I know most sysadmins wail and moan and gnash their teeth at the very thought, but good documentation is almost as important as good backups!
    YMMV but it might actually be worth picking somebody as the "doc-meister" to learn ALL of the systems and have the other admins submit config changes etc. to this person on an on-going basis.
    This also helps prevent the common admin trick of just printing out tons of scripts to fill the binder and saying "See, it's all documented, right here" - except it doesn't actually help anybody understand anything!
    This way if the documentation lead can't understand it then you know a replacement admin won't either and changes can be made before it's really needed.
  • by medcalf ( 68293 )
    So make them act like a team. All admins are responsible for all servers. I am assuming that you the group doesn't have a lot of time to document (most groups don't), but there are still practical ways to make it work, with minimal time taken in advance:

    1. common file system layouts (for example, all users in /home, all apps in /apps, all admin-only stuff in /admin (or whatever standards you want to use)
    2. one person (team lead) owns all of the licenses, and keeps them up to date, as well as scheduling non-reactive work
    3. if you're not responsible for the applications on the system, then everyone should be able to handle any machine, since no specialized knowledge is needed
    4. of course, specialized knowledge is still needed, because some systems have quirks. Document the quirks only (not standard routines for the whole team) both on the machine (in /admin/local/README or whatever) and on your team webserver - if you don't have one, get one
    5. keep a change log for each machine, in /admin/local and on the webserver, that describes any changes that aren't in someone's home directory and which survive a reboot - who did them, when and why
    6. make sure than standing orders (that is to say, procedures to always be followed, like how to notify clients of an outage) are posted on each machine and on the website
    7. use a common root password, known by the team lead and his manager. everyone else uses sudo su - to get to root, or some similar means. give them the root password if they need it (reinstall system, for example), then change it the next day. ideally, set up a system so that each admin saves to a different history file, so that you can tell who did what if you need to (tracking down mysterious file disappearances and such) - this isn't a tool for discipline, it's a tool for troubleshooting

    That should solve most of the problems.
    • 7. use a common root password
      Um, how about . . . No.

      Having the same password on multiple machines is bad. Very bad. Especially when it's a root password. Someone compromises one box, suddenly they've compromised all of them. Not good.

      • If you have a different root password on each system, you have to write them down, which is much less secure than having one very difficult-to-crack root password. If these are external systems, I am assuming that the questioner has been smart enough to only allow ssh or similar access anyway, rather than something which passes passwords in the clear.
        • It's not a question of whether or not the passwords are being sent in cleartext. There have been holes [securityfocus.com] found [securityfocus.com] in SSH before, and there probably will be again. Plus, there's an excellent chance that SSH isn't the ONLY thing listening on these boxes. A hole in ANY service running can be enough for someone to get in. And once someone's in, it's much easier to grab root access, because it's easier to keep tabs on what's listening on ports than all the thousands of binaries that aren't. Once you've got root on a box, it's a simple matter of installing some trojaned binaries to grab passwords for you. It doesn't matter if the password's been sent in plaintext or not.

          And things can get very quickly complicated, because again, once a malicious person has gained access to ONE of your systems, suddenly it's completely trivial to get into all the rest. If you enforce different passwords on each box, then you're containing the fire. The blackhat will still have 0wNz0r3d one of your boxes, but it's contained there, and he's got to go through the same amount of work to get into any of the others, which increases the probability of someone noticing illicit behaviour, increases the probability that this person will screw up and make a mistake, and increase the probability that he might not be able to get in at all.

          As to writing passwords down, obviously that's a problem. If people are going to be writing passwords down somewhere, you've got to have a good deal of actual, physical security if you want to be able to feel safe about it. It helps to have passwords related somehow. Pick a paragraph from a book; the first letter of each word in sentence 1 makes up the password for box 1, the second sentence goes for box 2 . . . There's many ways to relate passwords such that it's easier to remember.

          Remember, you're not just defending against a brute-force cracker or someone sniffing plaintext passwords. There's much more to it than that.

          • I can understand wanting to have different passwords on each system but that can get complicated. Where I used to work I had to use two userids on over 160 servers (all clients which I had to access for trouble shooting) and over 10 master servers (1 userid). I could have used another document to store all the passwords but then it's the same deal - one person gets that file (even if it's encrypted), cracks it and then they have access to all - same deal as if I had one password. It wouldn't have been fesible for me to remember that many passwords - I have other things to remember.
            • Right, it certainly is a problem. As I mentioned in another post in this thread, I think the ideal solution would be to have some way to generate passwords based on the host name (or IP, or whatever) of the boxes you've gotta keep track of, in a non-obvious and somewhat secure way. Like, you'd have an application, password-protected itself, of course, that would have you input the name of the box. It'd then churn through a bunch of algorithms and transforms and eventually come out with a password for the box. The algorithm would obviously have to be tweakable, so you could change passwords in a uniform fashion, and the security on the program itself is paramount (you don't want just anyone getting access to the program you're using, or the algorithms used).

              I found a project called Twonz [venge.net] that does something like that. You input a "base" password, and then the name of the host, or IP, or whatever, and it computes what the actual password would be. It looks a bit incomplete for a scalable solution; as I mentioned, I'd like to have the app itself be password protected, and have the ability to mess around with the generation algorithm, but the basic bit is there . . .

              Just an idea, anyway. :)

          • So use the same password, change it regularly
            and use tripwire. For mission critical systems
            the last thing you want is for an admin to
            spend 10 minutes figuring out the password
            before they can even start to fix a problem.
      • Having the same password on multiple machines is bad. Very bad

        Okay, so what's your solution? I've got 3000 machines (hypothetical - I've only got a couple hundred) - can I come up with a pattern I can memorize that will allow for unique passwords that require memorization of a base password and some knowledge that lets me perturb the base?

        I don't think so.

        So what - only allow sudo? So I can run all the rootly commands I want using - one password. Carry them around encrypted in a PDA? Not much confidence in the security that it seems most encryptors use.

        Unique passwords for this number of machines do not fit in my head or my wallet. Use something like S/key? Possibly...

        • I mentioned this elsewhere in the thread. I found a cool little program called Twonz [venge.net] that looks like it could be the start of a good solution for that. You just remember one "password", and then type in the name of the box, or the IP, or something else unique to the one box, and it'll "combine" the two to give you a fresh password. I haven't done more investigation than looking at that homepage, but there's only a few issues with it, as far as I can see:
          • I'd like the app itself to be password-protected, although that's not terribly necessary
          • I'd like to make sure that the transformation to get the final password is, indeed, a one-way transform. That way, given one password and the name of the box, you can't reverse engineer the "master" password.
          • It'd be nice to choose between a list of algorithms used to generate the master password, and to be able to tweak the algorithms for your own personal use.
          Anyway, it seems really good, because even if someone DOES get ahold of the program, they still won't be able to find out passwords to your systems without knowing that "master" password. So you remember that one password, and you can generate passwords for all your machines.

    • You're kidding, right?

      They'll compromise a real weak server/app, and then hit the rest of your network up
    • In addition to what's already been said about the common root password idea, it wouldn't even provide that many advantages were it not for the many, many drawbacks. Is the team really so large that it can't keep track of a separate password for each machine in an effective manner? If so, maybe you have bigger problems on your hands :)
      • Come on now..

        In all honesty, that is a good idea. Common usernames/passwords and root/passwords on an similarly configured machines eases administration, and makes it easier to memorize what the password is.

        I admin over 30 machines. Do you think I'm going to remember 60 different passwords? (one for the user, one for root, because we all know ssh/telnet should beallowed to login directly as root, right?) Hell no. Even if I used my palm, it would still be cumbersome. Instead, each class of server has it's own username/password structure... the linux boxes are of one type, and the "other" boxes are different.. This leaves me with around 10 passwords total to memorize, with one or two of the less used ones in a pgp locked spreadsheet on my laptop.

        Memorizing 60 distinctly different login/passwords is almost impossible.

        Other things other posters have said also have merit. Have the same directory structure on as many servers as possible (not possible when you compoare windows to unix for example). Have the same set of tools available for troubleshooting on the different platforms also.... GNU tools compile on almost any unix flavor. Use that to your advantage! There's no reason to remember the different key sequences to the various unix versions of "df" for example, when you can install the GNU version, and have the same command with the same switches do the same things on all of your unix servers. But please, leave your OS version there, just in case. =)

        In addition, you need someone hell bent on security. Have a couple of people install and setup Nessus and scan your network once a month or so, AND FIX THE HOLES. Every network/system has holes. Different servers will have different holes to patch. Only by actively looking for them will you find them. If a server cannot be patched for whatever reason, isolate it on the network with separate password/logins from the "secure" servers, and ACL/ACI's implemented to prevent that server from being able to access other servers it doesn't absolutely need access to.

        Eternal Vigelence. It is difficult to get to this point in a large network with 20+ specialized servers. But with a team that large, you should be able to do it...
        • Instead, each class of server has it's own username/password structure

          If I'm reading what you wrote correctly, you're talking about a pattern for choosing similar passwords within a class of machines, which is not only very clever but sounds like a good idea. It's not quite the same thing as identical passwords, though, which is what the parent was talking about.
        • Instead, each class of server has it's own username/password structure
          Okay, but that's a bit different from saying "Use the same root password for each machine." Making the password some function of the box name for a farm with 30+ machines would probably be okay if it's done right, but if it's something as simple as "blahblah-machinename," someone who's cracked their way to the root password of one machine might be able to figure it out and get in everywhere . . .

          If you're willing to carry PDAs around with you, it'd be pretty cool to have a program (itself passwd-protected, of course - you'll have to remember that one) that, given the name of the box, would hash the name somehow to come up with a unique password on a per-box basis. Just type in the box name and you've got the password. Obviously if anyone who wasn't supposed to could get into that program, you'd have Issues . . . I think I've seen some things on Sourceforge that do basically that.

  • For the last couple of years I have worked in just such an environment. Our biggest push the last year has been standardization of the documentation - what is in there, how it is organized, common issues each system has (does one webserver have a rogue java that requires periodic restarts, etc.), and everything in a central SECURE location that all the admins can get to at need.

    Prior to that, the system setups had to be standardized. The applications in thier own directories, running as non-root user (ie. under an /apps filesystem). Content in a standard place (ie. /apps/content) and so on. Startup scripts, and all those other fun stuff standardized. Infrastructure, DNS and bastion hosts, all that fun stuff needs to be built.

    So, first - a standardized (yes, there will be minor differences between systems) image for administration, second - documentation. For documentation, any admin should be able to pick up the documentation, and with as little effort as possible, do what is needed.

    And last, the issue of rotation - Yes. On a bi-annual basis, move people's assignments. Secondary becomes primary, primary becomes secondary on another project, etc. With in a year or two, everybody should have had experience on multiple systems, and the one person knowing everything syndrome should be gone.


  • 1. Document what you've got. Make the doumentation standard.
    2. Move the 'primes' around every couple of months so you all get exposure.
    3. Common install base. make sure you can automagically install from scratch the O/S's and applications (ge jumpstart on SOlaris, HPUX and AIX have their variants). If at any stage you need to type anything you've failed.
    4. Read, digest and implement "The Practice of System and Network Administration" by Limoncelli and Hogan ISBN: 0201702711. This is a great book for any admin and for me is the K&R of its subject.
    • I'll second the recommendation for "The Practice of System and Network Administration". It's the only book of its kind; independent of specfic platforms or technologies, and it stresses the six key principles of systems design and support practices: simplicity, clarity, generality, automation, communication, and the basics.

      The slashdot review is here [slashdot.org], and the freshmeat.net review is here [freshmeat.net]

      Where this book falls short in details (ie: exact policy wording, exact technical details of an issue, etc...), it gives you solid references to goto to get the details.

      This book should really be required reading for anyone thinking of going into (or already in) systems administration. I've got 7 years in the field, and this book is still teaching me a thing-or-two.

      And no, I dont get any kickbacks for this recommendation...
  • (various flavours of Unix)

    That's where your problem lies. Pick one, and get everyone comfortable with that one.
    • Pick one, and get everyone comfortable with that one.
      ...and while you're at it, get rid of all those people who don't use the one you've picked.

      And the products they're developing, or services they're offering, too.

      And we all know about those "problem apps" that you can never seem to get running well on that platform, well just bar them from the corporate LAN. Fire anyone who insists on using them.

      Heck, we all know about that one major app which works well and runs great but doubles the size of the backups; we'll let's get rid of that one, too. I hate having to remember to swap the tapes every Friday afternoon.

      After all, we can't let little things like efficiency, profitibility, or innovation stand in the way of a clean functioning computer room, now can we?

  • Essentially, you have a learning problem: how does one efficiently propagate detailed knowledge within a group?

    Answer: Sex. Continually pair off, sharing genes or, in this case, memes.

    I was brought into a company a few months ago for six weeks to finish a demo of a networked, multithreaded peer-to-peer system written in C++. Even though I hadn't much experience with networking or threading, I was up to speed in 2-3 days because we were doing extreme programming. Our pairs swapped every few hours, insuring that I saw a lot of the system in a short amount of time.
    In your case, I would set up double-rotations. Separate out a handful of logical groups of systems, so that you can have at least two pairs working on a particular group (if you have a ratio of 20 systems to admin and 320 systems, you could have 4 groups). They pair swap at least twice a day.
    Every so often (e.g. 2 weeks, 1 month, 2 months), have group swaps such that half the members in a group are swapped.

    The result is that when someone leaves, there are at least 3 other people in the affected group that have more-or-less the same knowledge as the person who departed. Specific knowledge travels into different groups when you have a group swap, but only so often, which means you don't have a loss in productivity by folks getting up-to-speed all the time. Over time, best practices will be learned by everyone and a positive team culture will develop.

    Not to mention that you'll make fewer mistakes since everything happens under the watchful gaze of a partner...

    (not that mistakes will disappear entirely :-)

    Pairing just works. Why not try it out with half of the staff and see how people like it? Make sure to get caught up on how extreme programming works and figure out what you want to adopt and what isn't feasible for your particular situation.

    Jon
  • Why not a Vulcan mind meld?

    Should effectively transfer all the needed knowledge, and a little of each persons personality, but that might not be such a bad thing.
  • Like me.
    I know Solaris/HP-UX/Linux/*BSD and Win2K and Some Cisco. AND I happen to want a new job...
    Hire me please........ ;-)

    • Everyone is very technically competent in each operating system. The problem is the "quirks" of each environment (Oracle, apache, Tuxedo, Tivoli, WebLogic, etc...).
    • Not to rain on your job-seeking parade but I know all of the above plus Netware 2.x through 6.x, NT all the way back to 3.50, AT&T Sys-V (not that anybody cares any more) and DG's AOS-VS and it STILL took me a year to find a new job.
      Admittedly I have been pretty picky, I probably could have gotten a job in a month or so if I just took the first thing that came along, but who wants to do tier-one tech support?!? :(

      Thank god I've never been laid off...

      (Well, okay, I'll admit it, I was once - but I was only 12 years old. :)
      • Don't worry I always carry an umbrella... I was busy getting a business degree while you were doing Netware.
        The job market sucks BIGTIME right now. I am at a dot.com that may soon turn into a dot.bomb.
        Hopefully I will find something before that happens. I have been looking for hard for about 3 months now.
        I could have had 3 or 4 positions for security IF I wanted contract work.
        Any good suggestions where to look? The job boards are pretty much worthless. I appreciate any suggestions.

        • I had to laugh when I read your response - I'm taking night classes for my MBA right now!
          But to answer your question I found FlipDog to be, hands down, the best job board!
          E.g. on Monster/Dice et al I search for "network" in the Seattle area and I get about 5 to 15 hits, most of them pretty lousy entry-level positions.
          The same search on FlipDog netted me about 150 positions, half of which were actually semi-relevant!
          The only problem I had was cycle time. FlipDog actively goes out and searches company websites for positions so a "new" position listing may actually be a week or so old and/or already filled (esp. in this economy).
          I did, however, find my new job there! (When even good local head-hunters who have worked wonders in the past couldn't.)

          Good luck!
  • Simply make it known that the next time a sysadmin is called upon to support a box for which he is designated as 'secondary', and his response is inadequate due to lack of knowledge, he will be fired.

    After the first sysadmin is fired, I guarantee you that the remaining ones, plus the new guy's successor, will very quickly come up with a system or systems which allows the proper level of knowledge-transfer.
    • After the first sysadmin is fired, I guarantee you that the remaining ones, plus the new guy's successor, will very quickly come up with a system or systems which allows the proper level of knowledge-transfer.

      First Sys-Admin: Where do we keep the good tonor cartridges? I'm going to print 100 copies of my resume tonight.

      Second Sys-Admin: I've got it in this printer lp23. You can use it as soon as I'm done printing my resume.

      New Sys-Admin: Hey! Save some of that tonor for me, too!

      Is that the type of knowledge transfer you mean?
  • make the environments the same.

    In the company I work for I ported all the scripts to Linux and Sun from HP. Thus we have 'sameness'. you want to build a database. It does not matter what platform you are on the command is the same. You want to create a new dev env. It does not matter what platform you are on the script is the same on ALL platforms.

    By creating a level of 'sameness' across all your platforms it will not matter weather the server is Sun, HP, Linux, BSD, whatever the scripts will all be the same. Since you are talking about being an admin I'd suggest all scripts in perl or sh. The problem you may run into with perl is that perl rarely gets installed in the same place on all platforms. Thus the start of a script with /usr/bin/perl may not work, where /bin/sh will. Yes and there are coding ways around the perl issue as well.

    Granted you will have different machine that do different jobs, this is where documentation comes in. Make sure that all your stuff is documented. If someone sets up a server they need to be required to describe how this server was set up. Using the principle of sameness this cuts down on the need for lots of docs, and thus anyone can set up the server.

    Shells.. standardize on a shell. Standard login shell. More importantly is the standardization of what shell people use. I go with tcsh, as I like it better than ksh csh and sh, and it is available everywhere (Sun, HP, BSD, Linux, etc). It is also feature rich. You can standardize on any shell, but make sure it is everywhere you need it to be.

    Once you have standardized on a shell, use a standard login env. Thus when you login to your BSD box it feels like your Sun box, which feels like your Linux box, etc.

    If people want to add to this have a process in place to make it happen.

    Except for system tools like Sam on HP, and Redhats sysadmin tools, there is no reason that many other tasks cannot be done in scripts that are standard. You can even standardize on what a database server setup should include, what a web server setup should include, and have standards that are the same or different (I prefer sameness unless performsance is an issue) for each flavor of UNIX.

  • I have found this [dilbert.com] to be very effective.
  • I am assuming you have heard the standardization tips throughout this conversation. However, if you are unable to standardize the environment (maybe because its a client's environment and you have no control over it), you can try to adapt one of the concepts of Extreme Programming.

    Pair programming is a great way of doing knowledge transfer.

    Basically always work as a pair when doing problems. Also, rotate the pairs around so not everyone is stuck with the same job. Also make sure you rotate the "primes" around so that the "sys admins" can take over the prime role for a short while. (Just make sure the "prime" can be contacted if there is something really major).
  • As mentioned by other posters, documentation and job rotation is good.

    To make sure the documentation is up-to-date, accessible etc. you can make the support calls from the secondary admins to the "prime" non-paid. After a few support calls in the night, on your holiday etc. you will make sure that the documentation is up-to-date, accessible, people know where it is, etc.

    Yes, I am evil :-)

  • Two things (Score:5, Interesting)

    by coyote-san ( 38515 ) on Tuesday June 18, 2002 @07:20PM (#3725566)
    Your company needs to do two things.

    First, fire the manager who got you into this situation. If you've "been doing it this way for years," fire the manager who left this system in place. (If that manager just left and the new guy realizes there's a problem and that's why you're asking this, then obviously there's no action taken against him.)

    I'm not being bloodthirsty here - anything short of this will leave people doubtful that upper management is serious that this is *the* biggest problem your company faces today, and people will continue to do what they've been doing for anything but the most trivial problems. Senior management needs to send an unambiguous signal that the status quo is unacceptable.

    Second, rotate the primes and secondaries as others have suggested, but with a twist. Rotate the secondaries first, and their sole responsibility is to write a list of questions - a long list of questions - about everything that "surprises" them or that needs to be documented somewhere. (An example of the latter is "what are the partitions, what are their sizes, and how was this size determined?"

    They turn over their questions to the primes who spend a few weeks documenting the answers while the secondaries cover for their old prime, and this documentation is provided to the next set of secondaries rotated in to ask questions. Lather, rinse, repeat.

    By the third time around (maybe 3 months?) you'll have documentation that actually covers almost everything someone will need to get up to speed on the peculiarities of a particular project, and the primaries can start rotating while the secondaries answer any remaining questions.

    Finally, I'm deliberately putting the emphasis on the secondaries here because one of the classic problems with your old setup is that it can cause the secondaries's skills to stagnate if the prime handles all of the "hard" or "interesting" problems. You need to give the secondaries room to grow, even if it increases your turnover rate because they're competent enough to be hired as primes at other companies.

  • Put the pertinent information in a WikiWikiWeb [c2.com], then any of your admins can access or add info from any web browser that can see the Wiki server. Plus the whole idea of Wiki is so neat that they'll want to use it and watch it grow.
  • [putting on grumpy old man hat] back in the day when printers were dangerous to careless fingers and compiles were sometimes measured in days, there used to be a great community in which you could learn all manner of coolness just by hanging around and keeping alert.

    Watching really competent people work is a great way to learn your skill - like playing with really good musicians, your ability will improve as you learn all manner of cool tricks.

    It used to be that you could learn a great deal by building software. That can still be true, but it's mostly now a ./configure-and-forget kind of world (building TeX from Knuth source is excruciatingly painful. TeTeX saved TeX from self-immolation... but I digress).

    A strong leader can (and should) be very liberal with assigning projects that he knows the victim is unfamiliar with. That sort of where-do-I-begin anxiety usually leads to the right sort of questions, and gets communication flowing between your team members.

  • We shoot script(1) [ed.ac.uk] up everytime an admin procedure gets started so we document every line that appears on a terminal. Later we add commentaries for each of these lines, explaining it's purpose, and archive the hole file.
    With this you can infer what an specific environment looks like, how installations were dealt with or problems were solved, amongst other administrative duties.
    I guess the only disadvantage is that all your administrators will have to learn how to get away without emacs and vi, since they usually don't do well with script(1). Of course, there's always ed(1) [ed.ac.uk].
  • A lot of suggestions in this thread basically consist of "tell the admins to document everything." Which would be fine, execpt for the fact that your average admin isn't really very good at writing documentation. Depending on how many admins/servers/projects you have going on, you could do well to either hire a *nix-savvy tech writer on contract or hiring a full-time tech writer.

    Too many companies try to get by with sub-standard docs produced by people who aren't qualified to be writing documentation thinking that they're saving money by putting the job of documenting on the backs of people who are already overworked and will do whatever they can do to avoid writing documentation. In the long run, said company ends up wasting money by having people recreating the wheel or spending time having to fix problems that would never have arose if good documentation had been available in the first place.

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...