Solaris And Linux NFS Problems 3
mrgrumpy asks: "I run Debian (unstable, woody) kernel 2.2.14 with everything dangerously up to date, and I also run a Solaris7 (Sparc not Intel). I've had NFS working (not with autofs, just mount and share) between the two boxes fine for a few weeks since I set them up. I recently applied a patch cluster from SunSolve to the Sun box, and lo and behold, NFS stopped working. In the patch list there were quite a few NFS fixes with the kernel patch 106541-10. I have the home directories, and a development directory from the Sun box (which serves NIS, NFS, and just about everything) mounted on the Linux box. Most of the time nothing goes wrong. But, when I run the distributed.net client on the Linux box, which needs to read and write files that are mounted across from the Sun box, it locks up and I get messages such as 'Apr 17 14:20:31 boink kernel: nfs: task 1473 can't get a request slot...' in the logs." Can anyone figure out what's going wrong here? (Read more)
"My machines are: boink (GNU/Linux 2.2.14 kernel), and splat (Solaris7 [Sparc]). If I run snoop on the Sun box I get:
all throughout the logs. I had read that locking for nfs on Linux is not great, and I am using knfsd with NFS compiled in the kernelroot@splat$ snoop splat and boink rpc nfs
boink.home.cyber4.org -> splat.home.cyber4.org NFS C LOOKUP2 FH=009B_lJAQ.boink
splat.home.cyber4.org -> boink.home.cyber4.org NFS R LOOKUP2 OK FH=9668
boink.home.cyber4.org -> splat.home.cyber4.org NFS C LOOKUP2 FH=009B root.lock
splat.home.cyber4.org -> boink.home.cyber4.org NFS R LOOKUP2 OK FH=9E2D
boink.home.cyber4.org -> splat.home.cyber4.org NFS C REMOVE2 FH=009B_lJAQ.boink
splat.home.cyber4.org -> boink.home.cyber4.org NFS R REMOVE2 OK
boink.home.cyber4.org -> splat.home.cyber4.org NFS C LOOKUP2 FH=009B root.lock
splat.home.cyber4.org -> boink.home.cyber4.org NFS R LOOKUP2 OK FH=9E2D
boink.home.cyber4.org -> splat.home.cyber4.org NFS C WRITE2 FH=79D9 at0 for 4096 (retransmit)
boink.home.cyber4.org -> splat.home.cyber4.org NFS C CREATE2 FH=009B_yGAQ.boink
splat.home.cyber4.org -> boink.home.cyber4.org NFS R CREATE2 OK FH=AA9B
boink.home.cyber4.org -> splat.home.cyber4.org NFS C WRITE2 FH=AA9B at0 for 1
Usually the errors cascade down a spiral of death until I reboot the Linux box. in.lockd on Linux doesn't have a debug mode (Solaris does, restart it with -d3) and I don't seem to be able to find any other way of debugging it."
So at this point, grumpy either needs a solution method or a way to debug in.lockd. Are there any methods that may prove useful in attempting to recover from an NFS "spiral of death"?
Distributed.net client? (Score:3)
NFS under Linux is notorious for it's locknig problems. There looks like there are a few options here:
You'll probably run into this problem with anything that tries to do locking over NFS. I would slap together a few scripts or small programs that do locking over NFS to see if it might be the something strange that distributed.net client is doing. Maybe it's compiled for libc5 or libc6, and that might be causing problems (wild guesses here).
good luck.
darren
Cthulhu for President! [cthulhu.org]
you need.. (Score:1)
Linux NFS Project (Score:2)
http://sourceforge.net/project/?group_id=14
Linux and NFS are gradually maturing, but there still seems to be some incompatibilities lurking around. Following the NFS mailing list is a good way to keep on top of things, and would be the best place to post this question if none of the web pages answer your question.