|
Bugzilla – Full Text Bug Listing |
| Summary: | ipw2200 hanging when switching network settings with scpm | ||
|---|---|---|---|
| Product: | [openSUSE] SUSE LINUX 10.0 | Reporter: | Forgotten User ZhJd0F0L3x <forgotten_ZhJd0F0L3x> |
| Component: | Kernel | Assignee: | Olaf Kirch <okir> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Major | ||
| Priority: | P5 - None | CC: | deckel, jbohac |
| Version: | Final | ||
| Target Milestone: | --- | ||
| Hardware: | i586 | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | Component Test | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Bug Depends on: | |||
| Bug Blocks: | 97395 | ||
| Attachments: |
sysrq-t of the "no X input, gkrellm hangs" case
sysrq-t of another "no X input, gkrellm hangs" case Proposed patch sysrq-t of "gkrellm/ip/rcnetwork hangs" case sysrq-t with 2.6.15-20060105152643-default |
||
|
Description
Forgotten User ZhJd0F0L3x
2005-11-11 18:06:48 UTC
Created attachment 57140 [details]
sysrq-t of the "no X input, gkrellm hangs" case
Olaf: Assigning this to you. If this is nothing for you, please assign it back to us. This looks like a problem with the ipw2200 driver. Investigating. A number of processes in state D seem to be stuck on the rtnl_lock: kdm, smpppd, gkrellm, smpppd-ifcfg Others are blocked in flush_cpu_workqueue: kded, konsole, ssh. These are waiting for the worker thread events/0. events/0 is trying run the linkwatch_queue (net/core/link_watch.c), and tries to take the rtnl lock but blocks as well. The rtnl lock in turn is being held by wpa_supplicant which is in the middle of some ioctl, asking the driver to rescan the network. Apparently it's waiting for the driver to complete that scan, but holds the ipw->sem semaphore while doing so. On the other hand, any significant work in ipw2200 is done by the ipw2200/0 worker thread. And that worker thread is blocked on the ipw->sem semaphore... The code in ipw_request_direct_scan where wpa_supplicant blocks
is this:
if (priv->status & (STATUS_SCANNING | STATUS_SCAN_ABORTING)) {
err = wait_event_interruptible(priv->wait_state,
!(priv->status & (STATUS_SCANNING |
STATUS_SCAN_ABORTING)));
if (err) {
IPW_DEBUG_HC("aborting direct scan");
goto done;
}
}
Interestingly, there are two ipw2200 messages at the top of the log saying
kernel: ipw2200: Firmware error detected. Restarting.
Maybe this is related? (OTOH it seems as if restarting the driver will
call ipw_down, which will in turn clear ipw->status.
Assigning to maintainer of wireless-tools, cc'ing the 802.11 Jiris :)
deckel is seeing similar things with madwifi. Last week, i was still trying to use the r8169 driver and this hangup happened much more often. Now i have given up on the 8169 crap^Wdriver and use a xircom_tulip_cb card and the hangups are much less common. This might of course be a coincidence, but i thought i'd mention it :-) madwifi-ng to be exact. Machine is crashing every time I start a ipsec connection while using wireless with WPA-PSK. It's probably not related to madwifi. Maybe firmware is not responding with HOST_NOTIFICATION_STATUS_SCAN_COMPLETED properly. Also those "Firmware error detected" indicate a firmware problem. Are you using 2.3 version of the firmware (/lib/firmware/ipw-2.3-*)? (Unfortunately, upgrading of km_wlan package does not trigger updating of ipw-firmware package, which confused me recently.) root@susi:/dev> l /lib/firmware/ipw* -rw-r--r-- 1 root root 209190 2005-11-09 03:03 /lib/firmware/ipw2100-1.3.fw -rw-r--r-- 1 root root 201138 2005-11-09 03:03 /lib/firmware/ipw2100-1.3-i.fw -rw-r--r-- 1 root root 196458 2005-11-09 03:03 /lib/firmware/ipw2100-1.3-p.fw -rw-r--r-- 1 root root 6472 2005-11-09 03:03 /lib/firmware/ipw-2.2-boot.fw -rw-r--r-- 1 root root 166960 2005-11-09 03:03 /lib/firmware/ipw-2.2-bss.fw -rw-r--r-- 1 root root 16334 2005-11-09 03:03 /lib/firmware/ipw-2.2-bss_ucode.fw -rw-r--r-- 1 root root 161568 2005-11-09 03:03 /lib/firmware/ipw-2.2-ibss.fw -rw-r--r-- 1 root root 16312 2005-11-09 03:03 /lib/firmware/ipw-2.2-ibss_ucode.fw -rw-r--r-- 1 root root 6472 2005-11-09 03:03 /lib/firmware/ipw-2.3-boot.fw -rw-r--r-- 1 root root 166960 2005-11-09 03:03 /lib/firmware/ipw-2.3-bss.fw -rw-r--r-- 1 root root 16334 2005-11-09 03:03 /lib/firmware/ipw-2.3-bss_ucode.fw -rw-r--r-- 1 root root 161568 2005-11-09 03:03 /lib/firmware/ipw-2.3-ibss.fw -rw-r--r-- 1 root root 16312 2005-11-09 03:03 /lib/firmware/ipw-2.3-ibss_ucode.fw -rw-r--r-- 1 root root 165028 2005-11-09 03:03 /lib/firmware/ipw-2.3-sniffer.fw -rw-r--r-- 1 root root 16344 2005-11-09 03:03 /lib/firmware/ipw-2.3-sniffer_ucode.fw -rw-r--r-- 1 root root 6472 2005-11-09 03:03 /lib/firmware/ipw-2.4-boot.fw -rw-r--r-- 1 root root 168344 2005-11-09 03:03 /lib/firmware/ipw-2.4-bss.fw -rw-r--r-- 1 root root 16334 2005-11-09 03:03 /lib/firmware/ipw-2.4-bss_ucode.fw -rw-r--r-- 1 root root 162884 2005-11-09 03:03 /lib/firmware/ipw-2.4-ibss.fw -rw-r--r-- 1 root root 16312 2005-11-09 03:03 /lib/firmware/ipw-2.4-ibss_ucode.fw -rw-r--r-- 1 root root 168344 2005-11-09 03:03 /lib/firmware/ipw-2.4-sniffer.fw -rw-r--r-- 1 root root 16344 2005-11-09 03:03 /lib/firmware/ipw-2.4-sniffer_ucode.fw I hope the driver is taking the correct firmware :-) I have a feeling that this is partly an ipsec problem, partly a network driver problem and some drivers are just more susceptible to the bug than others. I have seen this very often back when i tried to use r8169, since i have given up on this and switched to xircom, it happens only occasionally. Created attachment 57571 [details]
sysrq-t of another "no X input, gkrellm hangs" case
it happened again.
A short description of my setup:
- 3 profiles, managed with scpm
- home, static ip, xircom and ipw2200 (wpa-psk) both managed by ifplugd
- road, dhcp, xircom and ipw2200 managed by ifplugd
- suse, static ip, xircom and ipw2200 (with wpa-enterprise) managed by ifplugd
last time i used ipsec was this morning in profile "road", via ppp/bluetooth/UMTS. Then i switched to the suse-profile, no problems.
This evening i switched back to the "road" profile, the switch seemed to work but when i arrived at home, i noticed that the keyboard was dead in X and gkrellm was hanging (since about profile switch time).
So it looks like using ipsec once "damages" something that will sooner or later lead to a malfunction of the networking functions.
As it seems not to be WLAN specific, I'm reassigning to kernel-maintainers. Guys, did anyone actually read my analysis in comments #4 and #5? This is clearly a bug in the ipw2200 driver. The latest sysrq-t in comment #10 shows exactly the same lockup in ipw2200. Created attachment 57645 [details]
Proposed patch
NB: I agree with Jiri Benc that the basic problem is that the driver waits for HOST_NOTIFICATION_STATUS_SCAN_COMPLETED which doesn't seem to come. In this respect, the proposed patch above is a workaround. But I'm not entirely certain whether this is also just a consequence of holding priv->sem while waiting for the notification to arrive... it would surely help to build ipw2200 with debugging enabled and look at the trace when the problem happens. i built a module with the patch. the module is built with IPW_DEBUG=Y but using the debug module parameter is impossible since it is vomiting constantly into syslog :-) Probably this is a more correct patch for this problem: http://sourceforge.net/mailarchive/forum.php?thread_id=8988005&forum_id=38938 Definitely worth testing. But I think there's a more general problem here. Anything on a work queue shouldn't block for that long, because it will take down most of the system as well. All the ipw2200_bg_foo handlers should be rewritten to not block on the semaphore forever, but to use down_trylock() and if that fails, use schedule_delayed_work to reschedule this action for a later time. i also believe that other network drivers have similar problems. Once i am sure that ipw2200 is no longer causing the problem, i'll start trying to use r8169 again and i'd bet the fun will begin again. Created attachment 57834 [details]
sysrq-t of "gkrellm/ip/rcnetwork hangs" case
I am pretty sure i am using ipw2200 with Olafs patch, but it hung again.
Just in case this may be something completely different, here is what i did:
- scpm switch
- when network was stopped, i pulled out the CardBus nic (i don't need it on the road anyway)
- scpm hung on "starting network", i checked with ps and the usal suspects were hanging in state D: gkrellm, ip
It's similar, but not the same. wpa_supplicant is still stuck in ipw_request_direct_scan, holding the rtnl_lock, but events/0 isn't blocked anymore - it's running (it would be useful to do a couple of sysrq-p as well to see what it's doing) And as you say, the scenario is a different one as well! This hang did not happen with ipsec activity, but after pulling out the card. Please try this on top of the previous patch (in ipw_abort_scan):
if (priv->status & STATUS_SCAN_ABORTING) {
IPW_DEBUG_HC("Ignoring concurrent scan abort request.\n");
return;
}
priv->status |= STATUS_SCAN_ABORTING;
+ wake_up_interruptible(&priv->wait_state);
err = ipw_send_scan_abort(priv);
i used ipsec before :-) The last hangs also did not happen when using ipsec, but _after_ using ipsec at the next profile switch. I also have seen solid lockups while establishing the ipsec connection, but not since i stopped using r8169. I applied the ipw_abort_scan patch and will try again. Any new hangups with these patches? Otherwise I'd propose to close this as resolved fixed. no. There are different hangups now (hard lockups, no sysrq possible) but those look unrelated. As soon as i can track them down, i'll open a new bug. can we _pretty please_ get this in the KOTD? This is hitting me several times a day and makes developing / testing network-switching applications like NetworkManager even more annoying :-) So i'd really like to see this bug as resolved fixed as soon as the patches are in the kernel :-) I looked into this again, and one thing that puzzles me is why the
SCAN_COMPLETED notification never seems to arrive.
One way for this to happen would be that the ipw_irq_tasklet gets stuck
when the HW reports a firmware error:
ipw_irq_tasklet (IPW_INTA_BIT_FATAL_ERROR aka Firmware error)
notify_wx_assoc_event
wireless_send_event
netlink_broadcast
...
netlink_broadcast_deliver (calls sk->sk_data_ready on rtnl socket)
rtnetlink_rcv
rtnl_lock
But it doesn't look like this is what's happening here. It seems in all
three sysrq-t's ksoftirqd is in cond_resched() the call to do_softirq.
Patch is in CVS. Please verify that this indeed fixes your problem. Thanks. I'll grab a kernel as soon as it falls into the kotd dir (and i have faster than 53.4kbit/s network again) and test it. If it doesn't fail for a week, i'll close the bug :-) does not help. This time i am sure that there is no ppp / ipsec etc. involved at all, so i'm de-obfuscating the subject of this bug. Wired NIC is tg3 (if this has anything to do with it). This time i did the following: - scpm switch "test" # my "on the road"-scheme - scpm switch "home". it hung while "checking for services that need to be restarted", ps ax showed "ip link show..." was in state D (should i also provide ps auxf of the hanging cases?) Attaching sysrq-t... Created attachment 62290 [details] sysrq-t with 2.6.15-20060105152643-default and just to answer the obvious question :-) seife@strolchi:~> uname -a Linux strolchi 2.6.15-20060105152643-default #1 Thu Jan 5 15:26:43 UTC 2006 i686 i686 i386 GNU/Linux seife@strolchi:~> rpm -q --changelog kernel-default-2.6.15-20060105152643 | head * Do Jan 05 2006 - okir@suse.de - patches.fixes/ipw2200-lockup-fix: ipw2200 - release semaphore when sleeping in ipw_request_direct_scan (133513). * Do Jan 05 2006 - olh@suse.de - add kernel-kdump for i386, x86_64 and ppc64, with minimal config just another bit of information that escaped me before: The card was unassociated at the time it hung. Profile "test" has "no authentication / encryption; dhcp client" configured. Profile "home" has "WPA-PSK, static IP address" configured. My access point only allows WPA-PSK so the card was probably still scanning for a network (and dhcpcd trying to send DHCP requests like mad :-) An updated patch is in CVS now. Instead of trying to be clever, it just returns EGAIN if it finds there's a scan in progress. You may need to fix wpa_supplication to deal gracefully with this. Please test It would be interesting to run ipw2200 with debugging enabled - maybe this helps pin-point why the scan completion is never signalled to the waiting process. Full debugging may be too much, a combination of these may do it: #define IPW_DL_ERROR (1<<0) #define IPW_DL_WARNING (1<<1) #define IPW_DL_INFO (1<<2) #define IPW_DL_NOTIF (1<<10) #define IPW_DL_SCAN (1<<11) #define IPW_DL_ASSOC (1<<12) #define IPW_DL_DROP (1<<13) #define IPW_DL_FW (1<<16) #define IPW_DL_RF_KILL (1<<17) #define IPW_DL_FW_ERRORS (1<<18) *** Bug 139332 has been marked as a duplicate of this bug. *** i added "debug=474119" to the module parameters and will check for messages in syslog. This fix definitely should go upstream since shortly after inserting the latest and greatest ipw2200-1.0.10 i got from Joe, it hung again, so the problem is not fixed upstream in newer versions. Our patched version works fine so far for me. I submitted the patch, no reaction so far. I'll try again once I have a confirmation from you that the patch really fixes the bug. Any news? Does the patch fix the bug for you? i have not seen it again, so it looks like it is fixed. I'll reopen if it happens again. Mainline has this fix, but the ipw2200 update I did for Intel removed it again. I fixed the patch now, but beta2 will be broken wrt this. Just for the record. Ok. I also did see it hang on a 2.6.16-rc1-git3-3 i got from trenn to test his ACPI madness. But i did not dare to report anything with an alien kernel with unknown patch status ;-)) |