Bug 133513 - ipw2200 hanging when switching network settings with scpm
Summary: ipw2200 hanging when switching network settings with scpm
Status: RESOLVED FIXED
: 139332 (view as bug list)
Alias: None
Product: SUSE LINUX 10.0
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Final
Hardware: i586 Other
: P5 - None : Major
Target Milestone: ---
Assignee: Olaf Kirch
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 97395
  Show dependency treegraph
 
Reported: 2005-11-11 18:06 UTC by Forgotten User ZhJd0F0L3x
Modified: 2006-01-25 12:56 UTC (History)
2 users (show)

See Also:
Found By: Component Test
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
sysrq-t of the "no X input, gkrellm hangs" case (76.44 KB, text/plain)
2005-11-11 18:15 UTC, Forgotten User ZhJd0F0L3x
Details
sysrq-t of another "no X input, gkrellm hangs" case (100.65 KB, text/plain)
2005-11-16 21:23 UTC, Forgotten User ZhJd0F0L3x
Details
Proposed patch (837 bytes, patch)
2005-11-17 14:40 UTC, Olaf Kirch
Details | Diff
sysrq-t of "gkrellm/ip/rcnetwork hangs" case (118.25 KB, text/plain)
2005-11-21 06:30 UTC, Forgotten User ZhJd0F0L3x
Details
sysrq-t with 2.6.15-20060105152643-default (97.79 KB, text/plain)
2006-01-09 06:33 UTC, Forgotten User ZhJd0F0L3x
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Forgotten User ZhJd0F0L3x 2005-11-11 18:06:48 UTC
I am seeing this quite frequently since i use the bullpen / turnpike novell vpn client in different "flavours": 
- sometimes the kernel just hangs hard during initialization of the ipsec connection, in which case the "famous last words are:
Nov  7 23:08:52 susi racoon: INFO: accept a request to establish IKE-SA: 195.135.221.15
Nov  7 23:08:52 susi racoon: INFO: initiate new phase 1 negotiation: 10.161.19.38[500]<=>195.135.221.15[500]
Nov  7 23:08:52 susi racoon: INFO: begin Aggressive mode.
Nov  7 23:08:52 susi racoon: ERROR:  SHA final length:20 Payload len till now:6
Nov  7 23:08:52 susi racoon: INFO: Selected NAT-T version: (null)
Nov  7 23:08:52 susi racoon: INFO: NAT detected: ME PEER
Nov  7 23:08:52 susi racoon: INFO: KA list add: 10.161.19.38[500]->195.135.221.15[500]
Nov  7 23:08:52 susi racoon: INFO: ISAKMP-SA established 10.161.19.38[500]-195.135.221.15[500] spi:1b7ae76b7c2bf6
fb:55b03bfe5074b2ed
Nov  7 23:08:55 susi racoon: INFO: respond new phase 2 negotiation: 10.161.19.38[0]<=>195.135.221.15[0]
Nov  7 23:08:55 susi racoon: ERROR: failed to get sainfo.
Nov  7 23:08:55 susi racoon: ERROR: failed to get sainfo.
Nov  7 23:08:55 susi racoon: ERROR: failed to pre-process packet.
Nov  7 23:08:55 susi racoon: ERROR: failed to get sainfo.
Nov  7 23:11:00 susi syslog-ng[4984]: syslog-ng version 1.6.8 starting
At this moment, only sysrq-B or sysrq-O are effective, sync, unmount do no longer do anything.

- sometimes, everything works fine, but if i (an hour later or so) try to switch with scpm to another network profile, everything accessing /proc/net or something like that hangs: ip, ifconfig, everything in state D. I can usually reboot more or less cleanly (with the use of sysrq-e to get of the network stop scripts)

- sometimes the machine just hangs "partially": X loses its keyboard, gkrellm (a system monitor that also has a network throughput monitor) hangs, but the mouse still works and i can usually reboot somehow via the kde logout menu and with the help of sysrq-e etc. I have a sysrq-T trace of this case which i will attach.

Note that i am usually connecting via usb-bluetooth => rfcomm => GPRS

This is really nasty, and i have no real idea on what to provide for debugging, so please tell me what you need (i am quite sure i can reproduce it soon).
Comment 1 Forgotten User ZhJd0F0L3x 2005-11-11 18:15:30 UTC
Created attachment 57140 [details]
sysrq-t of the "no X input, gkrellm hangs" case
Comment 2 Michael Gross 2005-11-14 17:53:12 UTC
Olaf: Assigning this to you. If this is nothing for you, please assign it back to us.
Comment 3 Olaf Kirch 2005-11-15 14:06:49 UTC
This looks like a problem with the ipw2200 driver. Investigating.
Comment 4 Olaf Kirch 2005-11-15 14:43:13 UTC
A number of processes in state D seem to be stuck on the rtnl_lock:
kdm, smpppd, gkrellm, smpppd-ifcfg

Others are blocked in flush_cpu_workqueue: kded, konsole, ssh.
These are waiting for the worker thread events/0.

events/0 is trying run the linkwatch_queue (net/core/link_watch.c),
and tries to take the rtnl lock but blocks as well.

The rtnl lock in turn is being held by wpa_supplicant which is in the
middle of some ioctl, asking the driver to rescan the network.
Apparently it's waiting for the driver to complete that scan,
but holds the ipw->sem semaphore while doing so.

On the other hand, any significant work in ipw2200 is done by the ipw2200/0
worker thread. And that worker thread is blocked on the ipw->sem semaphore...
Comment 5 Olaf Kirch 2005-11-15 15:07:44 UTC
The code in ipw_request_direct_scan where wpa_supplicant blocks
is this:

        if (priv->status & (STATUS_SCANNING | STATUS_SCAN_ABORTING)) {
                err = wait_event_interruptible(priv->wait_state,
                                !(priv->status & (STATUS_SCANNING |
                                                  STATUS_SCAN_ABORTING)));
                if (err) {
                        IPW_DEBUG_HC("aborting direct scan");
                        goto done;
                }
        }

Interestingly, there are two ipw2200 messages at the top of the log saying
kernel: ipw2200: Firmware error detected.  Restarting.
Maybe this is related? (OTOH it seems as if restarting the driver will
call ipw_down, which will in turn clear ipw->status.

Assigning to maintainer of wireless-tools, cc'ing the 802.11 Jiris :)
Comment 6 Forgotten User ZhJd0F0L3x 2005-11-15 22:23:26 UTC
deckel is seeing similar things with madwifi.

Last week, i was still trying to use the r8169 driver and this hangup happened much more often. Now i have given up on the 8169 crap^Wdriver and use a xircom_tulip_cb card and the hangups are much less common. This might of course be a coincidence, but i thought i'd mention it :-)
Comment 7 Christian Deckelmann 2005-11-16 11:27:08 UTC
madwifi-ng to be exact.
Machine is crashing every time I start a ipsec connection while using wireless with WPA-PSK.

Comment 8 Jiri Benc 2005-11-16 16:24:32 UTC
It's probably not related to madwifi.

Maybe firmware is not responding with HOST_NOTIFICATION_STATUS_SCAN_COMPLETED properly. Also those "Firmware error detected" indicate a firmware problem. Are you using 2.3 version of the firmware (/lib/firmware/ipw-2.3-*)? (Unfortunately, upgrading of km_wlan package does not trigger updating of ipw-firmware package, which confused me recently.)
Comment 9 Forgotten User ZhJd0F0L3x 2005-11-16 19:54:38 UTC
root@susi:/dev> l /lib/firmware/ipw*
-rw-r--r--  1 root root 209190 2005-11-09 03:03 /lib/firmware/ipw2100-1.3.fw
-rw-r--r--  1 root root 201138 2005-11-09 03:03 /lib/firmware/ipw2100-1.3-i.fw
-rw-r--r--  1 root root 196458 2005-11-09 03:03 /lib/firmware/ipw2100-1.3-p.fw
-rw-r--r--  1 root root   6472 2005-11-09 03:03 /lib/firmware/ipw-2.2-boot.fw
-rw-r--r--  1 root root 166960 2005-11-09 03:03 /lib/firmware/ipw-2.2-bss.fw
-rw-r--r--  1 root root  16334 2005-11-09 03:03 /lib/firmware/ipw-2.2-bss_ucode.fw
-rw-r--r--  1 root root 161568 2005-11-09 03:03 /lib/firmware/ipw-2.2-ibss.fw
-rw-r--r--  1 root root  16312 2005-11-09 03:03 /lib/firmware/ipw-2.2-ibss_ucode.fw
-rw-r--r--  1 root root   6472 2005-11-09 03:03 /lib/firmware/ipw-2.3-boot.fw
-rw-r--r--  1 root root 166960 2005-11-09 03:03 /lib/firmware/ipw-2.3-bss.fw
-rw-r--r--  1 root root  16334 2005-11-09 03:03 /lib/firmware/ipw-2.3-bss_ucode.fw
-rw-r--r--  1 root root 161568 2005-11-09 03:03 /lib/firmware/ipw-2.3-ibss.fw
-rw-r--r--  1 root root  16312 2005-11-09 03:03 /lib/firmware/ipw-2.3-ibss_ucode.fw
-rw-r--r--  1 root root 165028 2005-11-09 03:03 /lib/firmware/ipw-2.3-sniffer.fw
-rw-r--r--  1 root root  16344 2005-11-09 03:03 /lib/firmware/ipw-2.3-sniffer_ucode.fw
-rw-r--r--  1 root root   6472 2005-11-09 03:03 /lib/firmware/ipw-2.4-boot.fw
-rw-r--r--  1 root root 168344 2005-11-09 03:03 /lib/firmware/ipw-2.4-bss.fw
-rw-r--r--  1 root root  16334 2005-11-09 03:03 /lib/firmware/ipw-2.4-bss_ucode.fw
-rw-r--r--  1 root root 162884 2005-11-09 03:03 /lib/firmware/ipw-2.4-ibss.fw
-rw-r--r--  1 root root  16312 2005-11-09 03:03 /lib/firmware/ipw-2.4-ibss_ucode.fw
-rw-r--r--  1 root root 168344 2005-11-09 03:03 /lib/firmware/ipw-2.4-sniffer.fw
-rw-r--r--  1 root root  16344 2005-11-09 03:03 /lib/firmware/ipw-2.4-sniffer_ucode.fw

I hope the driver is taking the correct firmware :-)
I have a feeling that this is partly an ipsec problem, partly a network driver problem and some drivers are just more susceptible to the bug than others. I have seen this very often back when i tried to use r8169, since i have given up on this and switched to xircom, it happens only occasionally.
Comment 10 Forgotten User ZhJd0F0L3x 2005-11-16 21:23:18 UTC
Created attachment 57571 [details]
sysrq-t of another "no X input, gkrellm hangs" case

it happened again.
A short description of my setup:
- 3 profiles, managed with scpm
  - home, static ip, xircom and ipw2200 (wpa-psk) both managed by ifplugd
  - road, dhcp, xircom and ipw2200 managed by ifplugd
  - suse, static ip, xircom and ipw2200 (with wpa-enterprise) managed by ifplugd

last time i used ipsec was this morning in profile "road", via ppp/bluetooth/UMTS. Then i switched to the suse-profile, no problems.
This evening i switched back to the "road" profile, the switch seemed to work but when i arrived at home, i noticed that the keyboard was dead in X and gkrellm was hanging (since about profile switch time).

So it looks like using ipsec once "damages" something that will sooner or later lead to a malfunction of the networking functions.
Comment 11 Joachim Gleissner 2005-11-17 11:50:16 UTC
As it seems not to be WLAN specific, I'm reassigning to kernel-maintainers.
Comment 12 Olaf Kirch 2005-11-17 14:36:46 UTC
Guys, did anyone actually read my analysis in comments #4 and #5?
This is clearly a bug in the ipw2200 driver.

The latest sysrq-t in comment #10 shows exactly the same lockup in ipw2200.
Comment 13 Olaf Kirch 2005-11-17 14:40:45 UTC
Created attachment 57645 [details]
Proposed patch
Comment 14 Olaf Kirch 2005-11-17 15:05:10 UTC
NB: I agree with Jiri Benc that the basic problem is that
the driver waits for HOST_NOTIFICATION_STATUS_SCAN_COMPLETED which
doesn't seem to come.

In this respect, the proposed patch above is a workaround.

But I'm not entirely certain whether this is also just a consequence of
holding priv->sem while waiting for the notification to arrive... it
would surely help to build ipw2200 with debugging enabled and look at
the trace when the problem happens.
Comment 15 Forgotten User ZhJd0F0L3x 2005-11-18 10:07:08 UTC
i built a module with the patch. the module is built with IPW_DEBUG=Y but using the debug module parameter is impossible since it is vomiting constantly into syslog :-)
Comment 16 Jiri Benc 2005-11-18 11:17:47 UTC
Probably this is a more correct patch for this problem: http://sourceforge.net/mailarchive/forum.php?thread_id=8988005&forum_id=38938
Comment 17 Olaf Kirch 2005-11-18 11:29:20 UTC
Definitely worth testing. But I think there's a more general problem
here. Anything on a work queue shouldn't block for that long, because
it will take down most of the system as well. All the ipw2200_bg_foo
handlers should be rewritten to not block on the semaphore forever, but
to use down_trylock() and if that fails, use schedule_delayed_work to
reschedule this action for a later time.
Comment 18 Forgotten User ZhJd0F0L3x 2005-11-18 11:42:50 UTC
i also believe that other network drivers have similar problems. Once i am sure that ipw2200 is no longer causing the problem, i'll start trying to use r8169 again and i'd bet the fun will begin again.
Comment 19 Forgotten User ZhJd0F0L3x 2005-11-21 06:30:19 UTC
Created attachment 57834 [details]
sysrq-t of "gkrellm/ip/rcnetwork hangs" case

I am pretty sure i am using ipw2200 with Olafs patch, but it hung again.
Just in case this may be something completely different, here is what i did:
- scpm switch
- when network was stopped, i pulled out the CardBus nic (i don't need it on the road anyway)
- scpm hung on "starting network", i checked with ps and the usal suspects were hanging in state D: gkrellm, ip
Comment 20 Olaf Kirch 2005-11-21 09:28:48 UTC
It's similar, but not the same. wpa_supplicant is still stuck in
ipw_request_direct_scan, holding the rtnl_lock, but events/0 isn't
blocked anymore - it's running (it would be useful to do a couple of sysrq-p
as well to see what it's doing)

And as you say, the scenario is a different one as well! This hang did not
happen with ipsec activity, but after pulling out the card.
Comment 21 Olaf Kirch 2005-11-21 09:34:03 UTC
Please try this on top of the previous patch (in ipw_abort_scan):

        if (priv->status & STATUS_SCAN_ABORTING) {
                IPW_DEBUG_HC("Ignoring concurrent scan abort request.\n");
                return;
        }
        priv->status |= STATUS_SCAN_ABORTING;
+       wake_up_interruptible(&priv->wait_state);

        err = ipw_send_scan_abort(priv);
Comment 22 Forgotten User ZhJd0F0L3x 2005-11-21 13:27:01 UTC
i used ipsec before :-)
The last hangs also did not happen when using ipsec, but _after_ using ipsec at the next profile switch.
I also have seen solid lockups while establishing the ipsec connection, but not since i stopped using r8169.

I applied the ipw_abort_scan patch and will try again.
Comment 23 Olaf Kirch 2005-12-19 10:48:38 UTC
Any new hangups with these patches? Otherwise I'd propose to close this
as resolved fixed.
Comment 24 Forgotten User ZhJd0F0L3x 2005-12-19 12:27:45 UTC
no. There are different hangups now (hard lockups, no sysrq possible) but those look unrelated. As soon as i can track them down, i'll open a new bug.
Comment 25 Forgotten User ZhJd0F0L3x 2006-01-05 12:20:36 UTC
can we _pretty please_ get this in the KOTD? This is hitting me several times a day and makes developing / testing network-switching applications like NetworkManager even more annoying :-)

So i'd really like to see this bug as resolved fixed as soon as the patches are in the kernel :-)
Comment 27 Olaf Kirch 2006-01-05 15:07:23 UTC
I looked into this again, and one thing that puzzles me is why the
SCAN_COMPLETED notification never seems to arrive.

One way for this to happen would be that the ipw_irq_tasklet gets stuck
when the HW reports a firmware error:

        ipw_irq_tasklet (IPW_INTA_BIT_FATAL_ERROR aka Firmware error)
        notify_wx_assoc_event
        wireless_send_event
        netlink_broadcast
        ...
        netlink_broadcast_deliver (calls sk->sk_data_ready on rtnl socket)
        rtnetlink_rcv
        rtnl_lock

But it doesn't look like this is what's happening here. It seems in all
three sysrq-t's ksoftirqd is in cond_resched() the call to do_softirq.
Comment 28 Olaf Kirch 2006-01-05 15:21:32 UTC
Patch is in CVS. Please verify that this indeed fixes your problem.
Comment 29 Forgotten User ZhJd0F0L3x 2006-01-05 15:36:22 UTC
Thanks. I'll grab a kernel as soon as it falls into the kotd dir (and i have faster than 53.4kbit/s network again) and test it. If it doesn't fail for a week, i'll close the bug :-)
Comment 30 Forgotten User ZhJd0F0L3x 2006-01-09 06:28:41 UTC
does not help.
This time i am sure that there is no ppp / ipsec etc. involved at all, so i'm de-obfuscating the subject of this bug. Wired NIC is tg3 (if this has anything to do with it).

This time i did the following:
- scpm switch "test" # my "on the road"-scheme
- scpm switch "home".
it hung while "checking for services that need to be restarted", ps ax showed "ip link show..." was in state D (should i also provide ps auxf of the hanging cases?)

Attaching sysrq-t...
Comment 31 Forgotten User ZhJd0F0L3x 2006-01-09 06:33:19 UTC
Created attachment 62290 [details]
sysrq-t with 2.6.15-20060105152643-default

and just to answer the obvious question :-)

seife@strolchi:~> uname -a
Linux strolchi 2.6.15-20060105152643-default #1 Thu Jan 5 15:26:43 UTC 2006 i686 i686 i386 GNU/Linux
seife@strolchi:~> rpm -q --changelog kernel-default-2.6.15-20060105152643 | head
* Do Jan 05 2006 - okir@suse.de
- patches.fixes/ipw2200-lockup-fix: ipw2200 - release semaphore
  when sleeping in ipw_request_direct_scan (133513).

* Do Jan 05 2006 - olh@suse.de
- add kernel-kdump for i386, x86_64 and ppc64, with minimal config
Comment 32 Forgotten User ZhJd0F0L3x 2006-01-09 06:37:43 UTC
just another bit of information that escaped me before: The card was unassociated at the time it hung.
Profile "test" has "no authentication / encryption; dhcp client" configured.
Profile "home" has "WPA-PSK, static IP address" configured.
My access point only allows WPA-PSK so the card was probably still scanning for a network (and dhcpcd trying to send DHCP requests like mad :-)
Comment 33 Olaf Kirch 2006-01-09 16:04:41 UTC
An updated patch is in CVS now. Instead of trying to be clever, it
just returns EGAIN if it finds there's a scan in progress. You may need
to fix wpa_supplication to deal gracefully with this. Please test

It would be interesting to run ipw2200 with debugging enabled - maybe 
this helps pin-point why the scan completion is never signalled to the
waiting process. Full debugging may be too much, a combination of
these may do it:

#define IPW_DL_ERROR         (1<<0)
#define IPW_DL_WARNING       (1<<1)
#define IPW_DL_INFO          (1<<2)
#define IPW_DL_NOTIF         (1<<10)
#define IPW_DL_SCAN          (1<<11)
#define IPW_DL_ASSOC         (1<<12)
#define IPW_DL_DROP          (1<<13)
#define IPW_DL_FW            (1<<16)
#define IPW_DL_RF_KILL       (1<<17)
#define IPW_DL_FW_ERRORS     (1<<18)
Comment 34 Olaf Kirch 2006-01-10 15:03:52 UTC
*** Bug 139332 has been marked as a duplicate of this bug. ***
Comment 35 Forgotten User ZhJd0F0L3x 2006-01-12 11:59:36 UTC
i added "debug=474119" to the module parameters and will check for messages in syslog.
This fix definitely should go upstream since shortly after inserting the latest and greatest ipw2200-1.0.10 i got from Joe, it hung again, so the problem is not fixed upstream in newer versions.
Our patched version works fine so far for me.
Comment 36 Olaf Kirch 2006-01-12 12:21:44 UTC
I submitted the patch, no reaction so far. I'll try again once I have
a confirmation from you that the patch really fixes the bug.
Comment 37 Olaf Kirch 2006-01-25 10:57:11 UTC
Any news? Does the patch fix the bug for you?
Comment 38 Forgotten User ZhJd0F0L3x 2006-01-25 11:37:16 UTC
i have not seen it again, so it looks like it is fixed.
I'll reopen if it happens again.
Comment 39 Olaf Kirch 2006-01-25 12:32:34 UTC
Mainline has this fix, but the ipw2200 update I did for Intel removed
it again. I fixed the patch now, but beta2 will be broken wrt this.
Just for the record.
Comment 40 Forgotten User ZhJd0F0L3x 2006-01-25 12:56:49 UTC
Ok. I also did see it hang on a 2.6.16-rc1-git3-3 i got from trenn to test his ACPI madness. But i did not dare to report anything with an alien kernel with unknown patch status ;-))