Bug 133817 - nbd fails after a while (strange output from module when loading ...)
Summary: nbd fails after a while (strange output from module when loading ...)
Status: RESOLVED FIXED
Alias: None
Product: SUSE Linux 10.1
Classification: openSUSE
Component: Kernel (show other bugs)
Version: unspecified
Hardware: Other SuSE Linux 10.0
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: Pavel Machek
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-11-15 11:59 UTC by Forgotten User abccHJSkz0
Modified: 2006-06-12 10:14 UTC (History)
0 users

See Also:
Found By: Other
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Forgotten User abccHJSkz0 2005-11-15 11:59:54 UTC
Hi!! I used nbd on a SuSE9.3 (the provided and updated kernels in several versions , all from SuSE) with nbd-server and -client compiled by my self without any problems. I was glad to see, that nbd-tools are provided as a package with 10.0
But there are some minor and major problems:

nbd is setup on an updated 10.0, nbd-client runs on an updated 10.0 too. Starting the server and client no problem, mounting the nbd0 no problem too. But after a while the nbd stops without reason (tested with different fs on the export side:
ext2/3 on disk partition, squashfs on file) and the client hangs because no packets are exchanged any more over the network (checked with tcpdump). It is the same with the provided tools of version 2.7.4 and with the freshly compiled tools from project homepage 2.8.2.

Minor problem occurs via module loading:
nbd: registered device at major 43
nbd0: Request when not-ready
end_request: I/O error, dev nbd0, sector 4294965120
printk: 102 messages suppressed.
Buffer I/O error on device nbd0, logical block 536870640
nbd0: Request when not-ready
end_request: I/O error, dev nbd0, sector 4294965120
Buffer I/O error on device nbd0, logical block 536870640
nbd0: Request when not-ready
end_request: I/O error, dev nbd0, sector 4294964992
Buffer I/O error on device nbd0, logical block 536870624
nbd0: Request when not-ready
end_request: I/O error, dev nbd0, sector 4294964992
Buffer I/O error on device nbd0, logical block 536870624
nbd0: Request when not-ready
end_request: I/O error, dev nbd0, sector 4294964992
Buffer I/O error on device nbd0, logical block 536870624
nbd0: Request when not-ready
end_request: I/O error, dev nbd0, sector 4294964848
Buffer I/O error on device nbd0, logical block 536870606
nbd0: Request when not-ready
end_request: I/O error, dev nbd0, sector 4294964848
Buffer I/O error on device nbd0, logical block 536870606
nbd1: Request when not-ready
end_request: I/O error, dev nbd1, sector 4294965120
Buffer I/O error on device nbd1, logical block 536870640
nbd1: Request when not-ready
...
and so on until nbd15 is reached. There was another bug (#) stating problems with nbd and udev, they seem to be worked on but not solved :-( I would expect kernel problems (not the tools), because of the different setups tried (SuSE9.3 no problem at all).
Comment 1 Forgotten User abccHJSkz0 2005-11-15 12:00:38 UTC
The other bug-ID was #76874 ...
Comment 2 Lars Marowsky-Bree 2005-11-15 12:38:02 UTC
The messages during load occur from hotplug scanning the blockdevice.

Kay, can you take a look at how to disable that?
Comment 3 Kay Sievers 2005-11-15 13:38:26 UTC
Care to replace the line 182 in /etc/udev/rules.d/50-udev.rules:
  KERNEL=="ram*|loop*|fd*", GOTO="persistent_end"
with
  KERNEL=="ram*|loop*|fd*|nbd*", GOTO="persistent_end"

and see if the messages go away?
Comment 4 Forgotten User abccHJSkz0 2005-11-15 17:54:00 UTC
Sorry, same procedure :-(

added: KERNEL=="ram*|loop*|fd*|nbd*", GOTO="persistent_end"
to the file named and loaded module via "modprobe nbd" afterwards ...

nbd: registered device at major 43
nbd0: Request when not-ready
end_request: I/O error, dev nbd0, sector 4294965120
Buffer I/O error on device nbd0, logical block 536870640
nbd0: Request when not-ready
end_request: I/O error, dev nbd0, sector 4294965120
Buffer I/O error on device nbd0, logical block 536870640
nbd0: Request when not-ready
end_request: I/O error, dev nbd0, sector 4294964992
Buffer I/O error on device nbd0, logical block 536870624
nbd0: Request when not-ready
end_request: I/O error, dev nbd0, sector 4294964992
Buffer I/O error on device nbd0, logical block 536870624
nbd0: Request when not-ready
end_request: I/O error, dev nbd0, sector 4294964992
...

but that is only minor part of the problem :-(
Comment 5 Forgotten User abccHJSkz0 2005-11-25 15:36:04 UTC
Hi!! It seems to be a more generic kernel problem: A student of mine programs an alternative network block device (called dnbd). It operates over UDP instead of TCP and will be able to use more than one server in operation (for failover reasons). I get exactly the same problems as with the standard nbd
kernel module: After a while the client slows down and finally the operation of the network block devices stops completely. So it seems not to be a specific problem of the network block devices (both perform rather well under the 9.3 kernel - updated versions - I never got the problems I have with the 10.0 kernel!!)

Sorry forgot to click "provides the needed information" last time, so doing it this round :-) The information provided here might give some more insight ...
Comment 6 Forgotten User abccHJSkz0 2006-02-04 19:10:10 UTC
Just tested it again with updated kernel on both sides (server and client). Both server and client are 10.0 installations. The client does not hang as early as described above (depending on type of machine - faster on some - after few Megabytes of reads, takes longer to hang on others - more than 50 MBytes ...) There was/seems to be an issue with the scheduler!?

Kernel: Linux lsfks04 2.6.13-15.7-smp #1 SMP Tue Nov 29 14:32:29 UTC 2005 i686 athlon i386 GNU/Linux
(different machines: server 64bit and 32bit architecture, client same but only 32bit binaries on the clients)
Comment 7 Kay Sievers 2006-02-08 19:13:46 UTC
So, can we close the bug? I have no idea what to do here?
Comment 8 Forgotten User abccHJSkz0 2006-02-08 22:50:35 UTC
I will check it with newer kernels (got now the idea how to pivot_root/run-init from initial ramdisk). Hopefully the issue is fixed within them ...
Comment 9 Forgotten User abccHJSkz0 2006-02-09 18:37:48 UTC
Identical problem with the SuSE10.1 Beta3 version. I'm using kernel "Linux lsfks20 2.6.16-rc1-git3-7-default #1 Mon Jan 30 21:52:12 UTC 2006 i686 i686 i386 GNU/Linux". I read somewhere that there was a problem with some scheduler, so that the nbd-client/kernel hangs after a while :-( Server is
a patched SuSE10.0. Should I open a new bug for the SuSE10.1 !?
Comment 10 Kay Sievers 2006-02-10 02:17:03 UTC
Reassigning. "nbd-client/kernel hangs after a while" - I have no idea what's going wrong here.
Comment 11 Forgotten User abccHJSkz0 2006-02-12 15:39:28 UTC
Did exhaustive tests:

nbd hangs after a while, if not using the kernel command line option "elevator=noop" (did not test other elevators yet). It hangs when using ext2 or squashfs (the latter has problems itself - it has to be patched so it does not get into troubles because of scheduler itself).

If the elevator option is given I did not encounter problems, at least for the tests I did (sent around 500MB over the net).
Comment 12 Greg Kroah-Hartman 2006-02-16 05:22:49 UTC
Jens, looks like a block layer issue...
Comment 13 Jens Axboe 2006-02-16 09:57:38 UTC
I don't think nbd is supported, but I'll reassign to Pavel to take a look since it's his baby. Dirk, you should try and see if elevator=anticipatory works for you as well or if that hangs too.

Looking at the nbd request handling, it is doing an illegal down() inside the request_fn. nbd should be offloading the actual transmit to a work queue handler.
Comment 14 Forgotten User abccHJSkz0 2006-02-16 21:18:32 UTC
I just checked that: elevator=anticipatory seems to work like noop. The cfq scheduler (the default case) seems to deadlock the system in many cases (the kernel could be stopped with SysRQ-O still) ...
Comment 15 Jens Axboe 2006-02-17 15:24:00 UTC
I have checked in a CFQ fix that should make it work for you. Closing bug, please retest with the next beta kernel (not beta4, the one after that).
Comment 16 Mada Sailaja 2006-06-12 10:14:41 UTC
*** Bug 183818 has been marked as a duplicate of this bug. ***