Logging Unexpected Shutdowns/Crashes w/ Linux? 86
sweede asks: "I have a dedicated server that seems to reboot more often than it should. In Windows 2000/XP (maybe NT4.0?), if your computer or server crashes it will leave an event message in the Event Viewer for you to review on what went wrong. Is it possible to do something similar in Linux? Where a power outage or an unexpected kernel panic will leave a message in /var/log/event (or whatever) Searching Google for 'kernel trapping' doesn't give me a whole lot of info on the subject."
I'm pretty sure.. (Score:4, Informative)
And if I'm correct, if you turn on serial console, you'll get a Panic output on serial. Add a serial2IP box and you're set.
Re:I'm pretty sure.. (Score:4, Interesting)
Why not reserve a set place on the hard drive and write out error trap information there? There's no reason the filesystem needs to be involved at all. I'm going to guess that's what Windows does.
Re:I'm pretty sure.. (Score:4, Interesting)
100$ question: How do you break out of code inserted that might have had a bug? How do you determine what code had that bug?
Answer those, and then I'll trust Write_after_system_crash api
Re:I'm pretty sure.. (Score:2)
Re:I'm pretty sure.. (Score:2, Insightful)
I'd be happier with it writing to a floppy, serial, or other isolated subsystem. The difference between your swap partition and your root directory structure might be just 0x10000 in one of the register values, and that's considered too close to be worth risking.
YAW.
Re:I'm pretty sure.. (Score:3, Interesting)
Re:I'm pretty sure.. (Score:2, Interesting)
When it panics, it reruns the bootloader (use BIOS calls to read the first HD sector and go from there) [...]
When Linux panics, it usually has a good reason to do so. Something like a damaged descriptor table, overwritten kernel code, hardware that works wrong, and various other catastrophes. Panic means a real panic: You can not reliably use any hardware. So you can not rerun the bootloader, and you can not access the BIOS. You can only hope that a hardware watchdog card notices that the kernel has pani
Re:I'm pretty sure.. (Score:2)
Re:I'm pretty sure.. (Score:2)
I suppose I shouldn't be surprised. Linux doesn't implement a feature, ergo that feature is not worth implementing. I'm just glad the peo
Re:I'm pretty sure.. (Score:2)
I guess I didn't realize that the Linux kernel doesn't do a comparable function. I wonder why not? Solaris seems comfortable locating the swap slice and writing the kernel core dump. Is partition management so much less reliable on x86 boxes? And if so, why is that a limitation for non-x86 Linux kerne
Re:I'm pretty sure.. (Score:2, Informative)
The IDE/SCSI driver could still corrupt your data. How would you know where to write the info anyway? The kernel could easily calculate and store a block number, but could it trust the stored number after panic() is called? Maybe somebody overwrote that variable, or maybe some bad RAM caused it to spontaneously change.
Kernel panics are generally used as a last resort, when something goes really wrong and there's no s
How the big systems do it... (Score:2)
Part of the system startup would scan the dump for the logging buffers, extract the messages and append them to the log file. The file system would
Re:I'm pretty sure.. (Score:2)
Funny suggestion. That's exactly what solaris does. It writes its kernel core dumps to the swap partition.
Re:I'm pretty sure.. (Score:1)
Re:I'm pretty sure.. (Score:3, Informative)
Re:I'm pretty sure.. (Score:3, Insightful)
If it's only 1 computer, I'd probably use a real terminal or a cheapie like your minix box or printer.
Re:I'm pretty sure.. (Score:2, Insightful)
Re:I'm pretty sure.. (Score:2)
- A.P.
Re:I'm pretty sure.. (Score:2)
Mark the filesystem as 'not cleanly shutdown' when the OS boots. If it shuts down properly, mark it 'clean shutdown'. The only remaining option, an improper shutdown, will leave the flag in the original state - 'not cleanly shutdown'.
Only thing to watch out for is a proper shutdown not updating the flag correctly, which should be rare - if you get to the 50th item on a shutdown list ("update flag"), you're pre
Easy... (Score:4, Informative)
Re:Easy... (Score:1, Insightful)
Re:Easy... (Score:4, Funny)
Re: (Score:2)
Logging Unexpected Shutdowns/Crashes w/ Linux? (Score:3, Informative)
Re:Logging Unexpected Shutdowns/Crashes w/ Linux? (Score:3, Insightful)
Id be apt to turn on hangcheck with 1min restart + email on my servers. But better is to know what they failed by..
Re:Logging Unexpected Shutdowns/Crashes w/ Linux? (Score:2, Informative)
Flag it. (Score:3, Interesting)
Re:Flag it. (Score:4, Interesting)
1: System starts up (say clean).
2: It marks a bit on the partition that system has been started up.
3: Usage Usage Usage
4: Send shutdown
5: System umounts cleanly. Undoes "dirty bit"
6: Power == 0
On a dirty FS, stage #5 is never hit so when system comes back on, it checks the bit and detects unclean shutdown. The bit is never wrote during the unclean shutdown.
In the similar problem, I see problems when NTkern crashes. How exactly does it manage to:
1: Read the partitiom
2: Read the program on the partition
3: Run the insert log program to add log entry
4: Still have the "blue screen"
I smell nasty data corruption waiting to happen. After all, if you cant guarantee the state of the kernel, does it really justify reading, writing, and executing on a crashed kernel????
Re:Flag it. (Score:2)
Kernel Panic on Linux? Sounds like hardware prob. (Score:3, Informative)
Re:Kernel Panic on Linux? Sounds like hardware pro (Score:5, Funny)
(one who speaks from experiance)
Re:Kernel Panic on Linux? Sounds like hardware pro (Score:1)
Re:Kernel Panic on Linux? Sounds like hardware pro (Score:1)
--
Re:Kernel Panic on Linux? Sounds like hardware pro (Score:1)
Other OSes (Score:5, Informative)
IRIX will core dump to the swap partition. On the next boot it analyzes this core file, which includes various system logs, etc, and saves useful output in /var/adm/crash. You know you've done a good job when the kernel panic causes a panic, called a double panic. I used to be able to trigger those at will. Hrmm, I should test that on the current release.
AIX summarizes the likely causes of failure (power failure, someone pressed the power switch, or power supply died, etc). I've seen (but do not personally use) a similar thing with IRIX that actually assigns a percentage confidence level to its guess.
Of course, usually you know there was a power failure because your UPS told you so.... I did have one case where we had a very brief outage (or maybe just a brownout). Every machine in the building had rebooted.... except one. That RS/6000 had an eerie log message like "power failure detected". And no, it was not on a UPS. I was rather impressed.
Sadly, I don't know how to get any useful information out of linux. And don't give me crap about it never crashing. I can prove otherwise. Too bad I can't figure out why.... Maybe a kernel developer will read this and copy some ideas from the commercial Unix vendors.
Re:Other OSes (Score:4, Interesting)
I expected the Indy to kernel panic or turn off. Instead, below the complaints about the missing ethernet cable ("en0: link carrier not detected" or similar), there was a lone status message: "Power failure detected."
No UPS, no power saving devices of any kind, only the filter caps in the power supply between the logic board and the unreliable, crufty power system of a 70 year old house at the mercy of a power strip first used on my (brand new at the time) Atari 800. The other computer on the power strip (350 P2 running RH 7.1) rebooted hard, right in the middle of heavy FS activity. I had to hit the reset button before it would come back up again, too - the brownout hung the POST.
Re:Other OSes (Score:2)
Well, that's how I'd do it if *I* was a Quality Computer Manufacturer, anyway
Re:Other OSes (Score:3, Informative)
-Ster
Re:Other OSes (Score:4, Interesting)
Re:Other OSes (Score:2, Informative)
Re:Other OSes (Score:5, Informative)
IRIX will core dump to the swap partition.
FreeBSD does this. HP/UX does this. I always assumed Linux did it too, it just wasn't turned on by default. I guess I was wrong.
As a side note, my first job out of college was to analyze core dumps from HP/UX. There's an awful lot you can learn from these things. Not just stack traces, the entire memory of the system is contained in the dump. It's time consuming, but a large portion of the time you can find out *exactly* what went wrong.
Re:Other OSes (Score:2)
From that, you can run some tests on the core files to get some info about what went wrong as well as things like stack traces & process lists. Even without analyzing that, you'll generally get some info in /var/adm/messages.
Linux should have some method of capturing what errors were generated during crashes; count it as one of those "enterprise level" features....
Fairly standard for !(Linux) (Score:2)
Solaris does the same thing. Actually, I think several commercial Unixes do this. Some even provide some basic analysis tools so that you can pore over the /var/wherever/crash dumps yourself; see which processes were running, which ones were on the CPUs when it crashed, which instruction was executing, etc.
I've always been disappo
Re:Fairly standard for !(Linux) (Score:2)
I recall that even 4.1 and 4.2 BSD had crash dumps and rudimentary crash dump analysis tools. I think it was 4.2 BSD that introduced dumping to the swap partition and the utility that snarfed the dump on reboot. (I could be wrong though: it was 20 years ago).
Warning: gratuitous punctuation detected!
Re:Other OSes (Score:2, Interesting)
At Blue Screen, it will make a dump in the swap partition if so configured. The dump can be a 64k error summary (MiniDump), kernel memory dump, or a full physical memory dump (if swap > physical memory). While there is a slim possibility that doing this might make things worse (if the code to write the dump is corrupted, or the disk driver is corrupted), it is MUCH more likely that the information written will be useful. Also, the swap partition driver is pretty s
Re:Other OSes (Score:4, Funny)
Built-in UPS (Score:2)
At work, we used to have a SUN E250 [1]. One day, the power went away. (Turned out to be a problem with the airco.) After the power came back, I checked the logs, and I saw that the machine had been writing messages like "Power lost, running on backup power", and when the power came back "Switching to AC". (The monitor wasn't protected, so there was no way to check the machine during the blackout.)
The PC in the same room didn't have backup power and shut down. It came back with no ill effects.
The RS/600
Re:Event Log? (Score:1, Informative)
Apparently, he lives in the same parallel universe I do. I suppose you think the checkbox in Startup and Recovery labeled "Write an event to the system log" is there for looks?
Re:Event Log? (Score:1)
So what? I click the little checkbox, wait for my next BSOD, and voila, I've got something useful? Head shake time, unless your idea of useful is finding out that WinDoze is the problem and you have to wait for M$ to fix it. Bottom line, I'd rather have to work a little harder to find the problem, and be able to fix it, than to
Try the Linux Kernel Crash Dump (LKCD) patches (Score:5, Informative)
Re:Try the Linux Kernel Crash Dump (LKCD) patches (Score:2, Informative)
I took a look at it a while back and it looked interesting. Just checked back at the site and there is a reasonable howto provided by IBM in the doc section which should give you some idea of what/how it works.
Its worth a look, but honestly, it sounds a LOT more like a hardware issue than software. Is the server on an UPS? If not, get it on one. Reseat all your cards and ram etc then see if it crashes as regularly.
Re:Try the Linux Kernel Crash Dump (LKCD) patches (Score:2)
The more the client (owner of the server) can do himself, the better.
rc.local (Score:3, Informative)
If you want to be emailed if the system reboots, put something at the end of /etc/rc.d/rc.local, if you're using something like RedHat (SYSV init, IIRC).
Logwatch will probably let you know if the system rebooted also.
If you want a log of the kernel panic, or something else, that's a lot more complicated, as others ahve mentioned
"Linux crashes" are probably contact failure. (Score:4, Interesting)
As others have said, the "Linux crash" is probably hardware failure.
The most common cause of serious failure, if the software has been installed correctly and tested, is bad contacts. To fix the problem, just loosen the screws that hold the adapter cards, pull the cards out about 1 millimeter or 1/32 of an inch, push the cards back in fully, and re-tighten the screws. Also, pull all connectors off a similar amount, and push them back on. Do the same with the memory modules. That's all.
The scraping caused by moving the contact points a tiny amount is actually very violent on a micro scale. The scraping removes oxide that causes a contact to lose electrical conduction.
This is reliable information. I've been selling and occasionally repairing PCs since before IBM sold PCs, back in the days when personal computers cost $2300, had two diskette drives and no hard drive, and ran the CP/M operating system.
My guess is that, if you had a penny for every real crash of a stable distribution of Linux, after a few years you might still have to borrow money from your little brother to buy a piece of bubble gum.
Re:"Linux crashes" are probably contact failure. (Score:1, Flamebait)
Well the context around it is he is telling it to a Pharisee in answer to the question "Who is your neighbour?"
Well, Jesus tells the story, and when he asks the Pharisee which of the three men was his neighbour, the Pharisee answers: "The one who helped him." Y'see, the thing is that the Pharisees despise the Samaritans so much that he couldn't bring himself to even say their name.
I was just wondering if you saw any
Linux is VERY reliable. (Score:2)
Nonsense.
I have a lot of experience fixing hardware failures. Before I started doing computer work exclusively, I was an electronics design engineer. So, I'm able to understand hardware issues, and have something to contribute in that area.
Perhaps Slashdot people don't have much experience with hardware problems, and are skeptical of anyone who does, because answers to hardware problems are not usually modded up, and are often attacked.
Linux has millions of technically knowledgeable users. Tho
Re:Linux is VERY reliable. (Score:2)
Martin
Re:Linux is VERY reliable. (Score:3, Insightful)
Other people had no problems with that level.
The driver for the Realtek 8139 that came with the early 2.4 kernels used to kill the machine I first ran it on. Kill it stone dead, I had to hit reset to restart. The machine is dual-boot and worked fine under Win
When will they learn (Score:1, Offtopic)
It'll never catch the things you want... (Score:3, Interesting)
Re:It'll never catch the things you want... (Score:2)
There are tools to monitor CPU temperature under Linux. Most PC motherboards made in the last few years have included temperature monitoring. The biggest problem is that most motherboard makers include different "fudge factors" in their setup, so different mobos have different settings and finding the actual temperature can be tricky.
What I do:
Depends on what it's doing (Score:4, Informative)
Oh and here's a useful way of working out whether there was a crash or not:
last -x | grep "shutdown\|reboot"
Every reboot that doesn't have a matching shutdown was probably a crash (other than the last line).
Re:Depends on what it's doing (Score:1)
Don't forget the quotes
Here's how: (Score:5, Informative)
We had some early kernel 2.4 redhat boxes crashing like the dickens for a while, it was a kernel problem and only when it happened on a local machine under our eyes did we get to realise what had happened.
2) Network syslog;
If you syslog to a central machine not only does it make error spotting centralised and easier but it means you have the last gasps of the crashed machine logged on a machine that is still up.
Sam
serial console (Score:4, Informative)
If your machine crashes without a panic message, however, you're out of luck. Wait until crash dumps are available - I'm surprised this isn't a 2.6 feature. Until we get crash dumps that work 99% of the time (like on Sparc-Solaris), Linux will continue to suck. At least it sucks less than the alternatives.
sysrq disabled for some reason... (Score:2)
The "magic SysRq key" is a key sequence that allows some basic commands to be passed directly to the kernel. Kernel software developers use this interface to debug their software. Under most circumstances it can also be used to uncleanly reboot the computer, something that is otherwise difficult or expensive to do remotely.
Anyone can dial into a modem and send a break, so if the serial console is attached to a modem we need
Re:sysrq disabled for some reason... (Score:2)
So you're saying that this is a massive security hole on every one of Sun's Sparc machines that has gone unnoticed all these years, as it has the same problem.
Linux Auditing (Score:1)
Snare is capable of monitoring events such as file opens, execve's, setuid/setgid and so on, which may assist in tracking down the problem.
Red.
The cleaning team (Score:3, Funny)
Re:The cleaning team (Score:2)
"Hmm, that box in the corner is beeping. That doesn't sound good - I think I'll turn it off."
The box was a UPS.
Re:The cleaning team (Score:1)
Yes! (Score:5, Funny)
Is it possible to do something similar in Linux?
Yeah, but we have to wait until our SCO insider funnels us the code.
Linux Trace Toolkit (Score:2, Interesting)
LTT log every system call at a ns precision in a RAM buffer and then on disk. The events include, for instance, read/write/open operations, system call, interuptions, process state, disk and internet interface operations and so on. You can add specific event by modifying your application and recompile with the LTT library.
LTT is not yet included in the kernel and was not choosen after the "Halloween Freez
Fan? (Score:1)
Anyway Linux machines rarely crash in my experience and my top suspect is ussually hardware when it does.
Some ideas (Score:5, Informative)
There's also the LKCD [sourceforge.net] (Linux Kernel Crash Dumps) package:
KCD contains kernel and user level code designed to:
A few hints (Score:3, Interesting)
You must be mistaken. (Score:2, Funny)
Now, what was your name and address again?
NetDump & LKCD (Score:1)
chekout
http://www.redhat.com/support/wpapers/
another link with the lkcd patches
https://projects.clusterfs.com/lustre/Ne
How to know if a linux box went down.. (Score:1)
You wont be able to find out a why without taking a stroll down
Quick tip, try checking your irq's..
DRACO-