Bug 115200 - IPsec crashes kernel
Summary: IPsec crashes kernel
Status: RESOLVED DUPLICATE of bug 117749
Alias: None
Product: SUSE Linux 10.1
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Beta 2
Hardware: All All
: P5 - None : Blocker (vote)
Target Milestone: ---
Assignee: Marius Tomaschewski
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-09-04 09:14 UTC by Andreas Schwab
Modified: 2006-02-16 15:54 UTC (History)
6 users (show)

See Also:
Found By: Other
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Oops (1.95 KB, text/plain)
2005-09-05 14:43 UTC, Olaf Kirch
Details
Oops - cube in transport mode (1.65 KB, text/plain)
2005-09-05 22:15 UTC, Marius Tomaschewski
Details
patch-2.6.13-rc2-git3-21 (15.59 KB, patch)
2005-09-06 23:58 UTC, Olaf Hering
Details | Diff
ipsec-crash-bug115200.patch (1.64 KB, patch)
2005-09-07 07:48 UTC, Olaf Hering
Details | Diff
Possible patch (619 bytes, patch)
2005-09-07 16:07 UTC, Olaf Kirch
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Andreas Schwab 2005-09-04 09:14:15 UTC
The kernel crashes during startup of IPsec when trying to transmit packages 
over the tunnel before it is fully set up.
Comment 1 Andreas Schwab 2005-09-04 10:04:27 UTC
It appears that just enough load is needed to trigger the bug any time, not 
only during startup. 
Comment 2 Andreas Jaeger 2005-09-04 12:47:18 UTC
Is this on PowerPC?
Comment 3 Andreas Schwab 2005-09-04 14:23:14 UTC
Perhaps related: 
<http://marc.theaimsgroup.com/?l=linux-kernel&m=112578092026623&w=2> 
Comment 4 Olaf Kirch 2005-09-05 07:37:18 UTC
Could you provide one or more oopses please? 
Comment 5 Andreas Schwab 2005-09-05 07:55:30 UTC
There were none, not even xmon did say anything. 
Comment 6 Marius Tomaschewski 2005-09-05 09:38:34 UTC
I've contacted Olaf Hering and we can use lychee (attached to HMC)
to test it again on PPC.

BTW: I'm running beta4 since 4 days on my gateway (ipsec + nat-t)
and have no problems with it. It is a PII-Dual machine.

On i386 it seems to work without problems. The only exception is,
that a default setting of "plutowait=yes" was changed to "no" in
the new openswan version. This causes, that the tunnels are now
created on demand and the transmission of the first package per
tunnel fails (e.g. ping reports "temporary unavaliable" error).
Setting it back to "plutowait=yes" workarrounds the problem.
Comment 7 Lars Müller 2005-09-05 09:39:24 UTC
Same with my Beta 4 x86 laptop.
Comment 8 Andreas Schwab 2005-09-05 09:41:48 UTC
plutowait does not make any difference. 
Comment 9 Marius Tomaschewski 2005-09-05 09:51:58 UTC
Yes, plutowait changes the pluto behaviour (user space) only
and it seems, there is a kernel problem.
Comment 10 Lars Müller 2005-09-05 10:11:12 UTC
The system running openswan is behind a SuSE Linux 9.3 router/ gateway and I
have to do IPsec NAT-T as the internal network is a private.

I also have plutowait=yes in the config.

sysrq is active,  but doesn't help in the situation.  I have to turn off the system.

/etc/sysconfig/syslog:KERNEL_LOGLEVEL=7
Comment 11 Andreas Schwab 2005-09-05 10:15:32 UTC
NAT-T also does not make a difference. 
Comment 12 Marius Tomaschewski 2005-09-05 12:05:33 UTC
I've reproduced the problem, but not on the system where I've tryed to do it :-)

My ipsec gw to suse-ras is crashed, while I've tryed to "torture" a test system.
As I wrote, it transfered about 1,2GB in last 4 days and today I've rsynced the
beta4 CD ISOs without any problems:

# l beta4-cds.log
-rw-r--r--  1 root root 292 Sep  5 12:55 beta4-cds.log
# tail -2 beta4-cds.log
wrote 116 bytes  read 3594998622 bytes  596729.81 bytes/sec
total size is 3594559488  speedup is 1.00

About 15min after the sync was finished, the machine stopped. No reaction on
SYSRQ, NUM-Lock, not possible to wake up the screen saver on the text console.

A try to boot the system failed, the last boot message I was able to read was:
  Boot logging started on /dev/tty1(/dev/console) ...
after this two messages was displayed in a loop:
  ERROR: no hit for procs_blocked
  ERROR: no hit for procs_running
A reiserfsck --rebuild-tree /dev/sda3 (normal reiserfsck displayed no errors)
has put 13 files into /lost+found...

There are no log messages about any error. The last message from Pluto is, that
it has received a "Delete SA payload: replace IPSEC State" and tryed to reinit
the connection.
Comment 13 Marius Tomaschewski 2005-09-05 12:14:20 UTC
The test machines: PPC with Beta4 kernel+openswan vs. 9.3-i386 + openswan-2.4.0
are still working fine with load average: 4.13, 2.35, 1.64 on the PPC machine.
Comment 14 Marius Tomaschewski 2005-09-05 12:33:06 UTC
Jiri, can you take a look on this please?

There are 3 machines that we can use: g167.suse.de (currently 9.3-i386),
lychee.suse.de and cube.suse.de, both PPC64 and attached to HMC recheable
via "vncviewer boettger.suse.de:42".

lychee.suse.de currently runs a tunnel to g167 and has enabled the xmon
debugger.
Comment 15 Marius Tomaschewski 2005-09-05 12:47:15 UTC
Now, there is also a tunnel between g167 and cube.suse.de...
Comment 16 Olaf Kirch 2005-09-05 14:36:22 UTC
"ERROR: no hit for procs_blocked" seems to come from /sbin/startpar  
trying to read /proc/stat. It seems something is hosing the system  
memory big time.  
  
This is probably related to the L-K discussion referenced in comment #3,  
as that seems to be a case of bad memory corruption as well (the fact that  
it died in sha1_update is probably a red herring).  
Comment 17 Olaf Kirch 2005-09-05 14:43:06 UTC
Okay, I have an oops now. Simply did a ping -i 0.01 -s 4096 10.10.103.167 
on cube. 
Comment 18 Olaf Kirch 2005-09-05 14:43:42 UTC
Created attachment 48814 [details]
Oops
Comment 19 Olaf Kirch 2005-09-05 14:53:32 UTC
This is 100% reproduceable. It oopsed in the exact same location, with the 
exact same backtrace.  
 
(The oops attached above is a bit mangled though - my mouse sometimes pastes 
things twice; in this case please ignore everything after the first 
syscall_exit) 
Comment 20 Olaf Kirch 2005-09-05 14:58:51 UTC
The way an skbuff is laid out in memory, we have an skb head pointing 
to some memory that holds the packet data, followed by some "shared info". 
If we write past the end of the packet data, we clobber this shared info, 
which among other things contains an array of struct page pointers. 
Calling put_page on random "addresses" is a great way to corrupt your 
memory. 
Comment 21 Olaf Kirch 2005-09-05 15:01:39 UTC
Now we just need to find out if this is the tunnel mode stuff writing 
past the end of the packet, or the ESP code. 
 
Marius, could you check if this happens in transport mode as well? 
Or in tunnel mode, but AH mode only? 
 
I will be in a phone call in 2 minutes from now, so feel free to take 
cube. 
Comment 22 Marius Tomaschewski 2005-09-05 22:15:17 UTC
Created attachment 48858 [details]
Oops - cube in transport mode
Comment 23 Marius Tomaschewski 2005-09-06 08:29:21 UTC
If I change the config to
        auth=ah
        type=tunnel

no Oops happens, the packages are transfered:

02:52:16.762251 00:00:1c:b5:6f:96 > 00:02:55:6f:97:29, ethertype IPv4 (0x0800),
length 98: IP (tos 0x0, ttl  64, id 15086, offset 1480, flags [none], length:
84) 10.10.103.167 > 10.10.1.180: ah
02:52:16.762670 00:00:1c:b5:6f:96 > 00:02:55:6f:97:29, ethertype IPv4 (0x0800),
length 1514: IP (tos 0x0, ttl  64, id 15087, offset 0, flags [+], length: 1500)
10.10.103.167 > 10.10.1.180: AH(spi=0xed7cccc2,sumlen=16,seq=0x10da):
ESP(spi=0xcb3fa21b,seq=0x10da)
Comment 24 Olaf Hering 2005-09-06 16:26:22 UTC
it happens also with vanilla kernel, mailed netdev.
Comment 25 Olaf Hering 2005-09-06 19:28:09 UTC
2.6.12 works, 2.6.13-rc3 crashed like that:

 cpu 0x1: Vector: 300 (Data Access) at [c00000004b92f220]
    pc: c000000000211950: .memcpy+0x1d0/0x280
    lr: d0000000001a87e4: .esp_output+0x3f4/0x560 [esp4]   
    sp: c00000004b92f4a0
   msr: 8000000000009032
   dar: 4c766234
 dsisr: 42000000
  current = 0xc00000004b9c8030
  paca    = 0xc000000000517800
    pid   = 3949, comm = ping
enter ? for help
1:mon> t
[link register   ] d0000000001a87e4 .esp_output+0x3f4/0x560 [esp4]
[c00000004b92f4a0] d0000000001a8548 .esp_output+0x158/0x560 [esp4] (unreliable)
[c00000004b92f580] c0000000003b04e8 .xfrm4_output+0x238/0x450
[c00000004b92f640] c00000000036c188 .ip_push_pending_frames+0x4d8/0x590
[c00000004b92f700] c00000000038fbac .raw_sendmsg+0x66c/0x840
[c00000004b92f830] c00000000039b7cc .inet_sendmsg+0x6c/0xb0 
[c00000004b92f8d0] c00000000032a41c .sock_sendmsg+0x16c/0x1c0
[c00000004b92fae0] c00000000032b4c4 .sys_sendmsg+0x204/0x340
[c00000004b92fd10] c00000000034da24 .compat_sys_sendmsg+0x14/0x30
[c00000004b92fd90] c00000000034e400 .compat_sys_socketcall+0x290/0x2b0
[c00000004b92fe30] c00000000000d600 syscall_exit+0x0/0x18
--- Exception: c01 (System Call) at 0000000007f518f8
SP (ffbd38a0) is in userspace





 looking further.
Comment 26 Andreas Schwab 2005-09-06 21:38:40 UTC
This is a different bug that was fixed in rc4, see 
<http://marc.theaimsgroup.com/?l=linux-kernel&m=112183465507337&w=2>. 
Comment 27 Olaf Hering 2005-09-06 23:18:27 UTC
its in the first 34 of ~olh/patch*
Comment 28 Olaf Hering 2005-09-06 23:58:35 UTC
Created attachment 49008 [details]
patch-2.6.13-rc2-git3-21

this causes the oops
Comment 29 Olaf Hering 2005-09-07 07:43:47 UTC
The patch Herbert posted to the netdev list fixes the crash.
Odd, I can ping ok with vanilla 2.6.13. But I cant with our current cvs + the
patch from Herbert.
cube runs the "fixed" kernel. Please check why ping doesnt work, while it does
with vanilla 2.6.13.


g224:~ # ssh root@g167.suse.de "rcipsec restart"
g224:~ # ssh root@cube.suse.de "rcipsec restart ; sync"
g224:~ # ssh root@cube.suse.de "sync ; sleep 1 ; ping -i 0.01 -s 4096 10.10.103.167"
Comment 30 Olaf Hering 2005-09-07 07:48:59 UTC
Created attachment 49017 [details]
ipsec-crash-bug115200.patch

patch from Herbert which fixes the crash in put_page
Comment 31 Olaf Kirch 2005-09-07 07:58:20 UTC
re comment #29: g167 replies to the echo request, cube seems to unwrap it 
properly, but something's still wrong... 
Comment 32 Olaf Kirch 2005-09-07 10:49:16 UTC
The unwrapped and reassembled packet has a bad IP checksum. Still checking  
Comment 33 Olaf Kirch 2005-09-07 11:15:30 UTC
Marius, can you confirm that this works properly with 9.3 talking to 9.3? 
 
Comment 34 Olaf Kirch 2005-09-07 11:38:30 UTC
When I use "ping -s 1500", I can see two packets coming into cube. 
 
Packet1: Fragment, 1480 octets, checksum 1234 
Packet2, Fragment, 48 bytes, offset 1480, checksum ABCD 
 
This is then reassembled into one big packet of 1508 bytes; simply 
copying the checksum from the first packet. 
Comment 35 Olaf Hering 2005-09-07 11:42:12 UTC
can I reboot into a vanilla kernel? just type reboot if yes
Comment 36 Olaf Hering 2005-09-07 11:51:39 UTC
works with vanilla kernel, must be one of our patches.
Comment 37 Olaf Kirch 2005-09-07 11:53:24 UTC
interesting... 
Comment 38 Olaf Kirch 2005-09-07 12:01:57 UTC
Can you please try reverting the following patches: 
 
        patches.suse/netfilter-02-output-hooks 
        patches.suse/netfilter-03-input-hooks 
        patches.suse/netfilter-04-policy-lookup 
        patches.suse/netfilter-05-policy-checks 
        patches.suse/netfilter-06-policy-match 
 
Comment 39 Olaf Hering 2005-09-07 12:39:53 UTC
yes, this helps.

-       patches.suse/netfilter-02-output-hooks
-       patches.suse/netfilter-03-input-hooks
-       patches.suse/netfilter-04-policy-lookup
-       patches.suse/netfilter-05-policy-checks
-       patches.suse/netfilter-06-policy-match
++okir  patches.suse/netfilter-02-output-hooks
++okir  patches.suse/netfilter-03-input-hooks
++okir  patches.suse/netfilter-04-policy-lookup
++okir  patches.suse/netfilter-05-policy-checks
++okir  patches.suse/netfilter-06-policy-match
Comment 40 Marius Tomaschewski 2005-09-07 13:00:54 UTC
(In reply to comment #33)
> Marius, can you confirm that this works properly with 9.3 talking to 9.3? 
>  

Yes, it works fine on 9.3 (openswan-2.4.0rc4 => openswan-2.2.0 in this case):

root@Nirwana:~ # uname -a
Linux Nirwana 2.6.11.4-21.8-smp #1 SMP Tue Jul 19 12:42:37 UTC 2005 i686 i686
i386 GNU/Linux

root@Nirwana:~ # rpm -q openswan
openswan-2.4.0-4.1

root@Nirwana:~ # ping -i 0.01 -s 4096 ras.suse.de
PING ras.suse.de (10.0.9.7) 4096(4124) bytes of data.
4104 bytes from 10.0.9.7: icmp_seq=1 ttl=62 time=98.8 ms
[...]

--- ras.suse.de ping statistics ---
30 packets transmitted, 6 received, 80% packet loss, time 361ms
rtt min/avg/max/mdev = 98.807/192.206/286.146/64.127 ms, pipe 24

root@Nirwana:~ # ping -i 0.01 -s 4096 app-team
PING app-team.suse.de (10.10.2.53) 4096(4124) bytes of data.
4104 bytes from 10.10.2.53: icmp_seq=1 ttl=62 time=96.2 ms
[...]

--- app-team.suse.de ping statistics ---
19 packets transmitted, 3 received, 84% packet loss, time 232ms
rtt min/avg/max/mdev = 96.263/130.920/166.074/28.503 ms, pipe 13


I've tryed other sizes (1024, 1500, 2048, ...) as well.
Comment 41 Marius Tomaschewski 2005-09-07 13:05:17 UTC
Ahmm.. the ipsec peer is suse-ras (sles9) in above case.
Comment 42 Marius Tomaschewski 2005-09-07 14:06:58 UTC
NO, it doesn't. I've tested it on 9.3-i386 <=> 9.3-i386 now using
different openswan versions:

  openswan 2.4.0rc4   vs. 2.4.0rc4
  openswan 2.2.0+fix  vs. openswan 2.2.0+fix (enckeylen, Bug #95863)
  openswan 2.2.0(DVD) vs. openswan 2.2.0(DVD)

root@Xanthos:~ # uname -a
Linux Xanthos 2.6.11.4-21.8-smp #1 SMP Tue Jul 19 12:42:37 UTC 2005 i686 i686
i386 GNU/Linux
Trinity:~ # uname -a
Linux Trinity 2.6.11.4-21.7-smp #1 SMP Thu Jun 2 14:23:14 UTC 2005 i686 i686
i386 GNU/Linux

=> using -s 1024 works, using (1500 or) 4096 not - the peer (xanthos) does
  not response to the ping, independently from the openswan version:

16:04:05.451754 00:40:05:40:91:02 > 00:e0:81:24:21:ca, ethertype IPv4 (0x0800),
length 1514: IP (tos 0x0, ttl  64, id 40279, offset 0, flags [+], length: 1500)
192.168.110.5 > 192.168.110.1: ESP(spi=0x8f1cbcd2,seq=0xbdd3)
16:04:05.451939 00:40:05:40:91:02 > 00:e0:81:24:21:ca, ethertype IPv4 (0x0800),
length 86: IP (tos 0x0, ttl  64, id 40279, offset 1480, flags [none], length:
72) 192.168.110.5 > 192.168.110.1: esp
16:04:05.452384 00:40:05:40:91:02 > 00:e0:81:24:21:ca, ethertype IPv4 (0x0800),
length 1514: IP (tos 0x0, ttl  64, id 40278, offset 0, flags [+], length: 1500)
192.168.110.5 > 192.168.110.1: icmp 1480: echo request seq 1
16:04:05.452488 00:40:05:40:91:02 > 00:e0:81:24:21:ca, ethertype IPv4 (0x0800),
length 1514: IP (tos 0x0, ttl  64, id 40280, offset 0, flags [+], length: 1500)
192.168.110.5 > 192.168.110.1: ESP(spi=0x8f1cbcd2,seq=0xbdd4)
16:04:05.452506 00:40:05:40:91:02 > 00:e0:81:24:21:ca, ethertype IPv4 (0x0800),
length 86: IP (tos 0x0, ttl  64, id 40280, offset 1480, flags [none], length:
72) 192.168.110.5 > 192.168.110.1: esp
16:04:05.452506 00:40:05:40:91:02 > 00:e0:81:24:21:ca, ethertype IPv4 (0x0800),
length 1514: IP (tos 0x0, ttl  64, id 40278, offset 1480, flags [+], length:
1500) 192.168.110.5 > 192.168.110.1: icmp
16:04:05.453105 00:40:05:40:91:02 > 00:e0:81:24:21:ca, ethertype IPv4 (0x0800),
length 1230: IP (tos 0x0, ttl  64, id 40281, offset 0, flags [none], length:
1216) 192.168.110.5 > 192.168.110.1: ESP(spi=0x8f1cbcd2,seq=0xbdd5)
16:04:05.453105 00:40:05:40:91:02 > 00:e0:81:24:21:ca, ethertype IPv4 (0x0800),
length 1178: IP (tos 0x0, ttl  64, id 40278, offset 2960, flags [none], length:
1164) 192.168.110.5 > 192.168.110.1: icmp
16:04:05.453105 00:40:05:40:91:02 > 00:e0:81:24:21:ca, ethertype IPv4 (0x0800),
length 4138: IP (tos 0x0, ttl  64, id 40278, offset 0, flags [none], length:
4124, bad cksum 5a73 (->7033)!) 192.168.110.5 > 192.168.110.1: icmp 4104: echo
request seq 1
16:04:10.451210 00:40:05:40:91:02 > 00:e0:81:24:21:ca, ethertype ARP (0x0806),
length 60: arp who-has 192.168.110.1 tell 192.168.110.5
16:04:10.451238 00:e0:81:24:21:ca > 00:40:05:40:91:02, ethertype ARP (0x0806),
length 42: arp reply 192.168.110.1 is-at 00:e0:81:24:21:ca
Comment 43 Marius Tomaschewski 2005-09-07 14:12:18 UTC
But no crash happens here.
Comment 44 Olaf Kirch 2005-09-07 16:03:30 UTC
Good. So the problem is that these policy patches cause the reassembled
packet to be routed through ip_rcv once more, which checks the IP csum.
This needs an added ip_send_check somewhere (preferably somewhere in
xfrm_input.c).

I will not have any time to debug this tomorrow, so if someone would
like to try this, be my guest. Jiri?
Comment 45 Olaf Kirch 2005-09-07 16:07:24 UTC
Created attachment 49085 [details]
Possible patch

What the heck. Heres a patch you may want to try.
Comment 46 Jiri Bohac 2005-09-07 18:07:27 UTC
Sorry, I won't be able to look into this until next week.
Comment 48 Olaf Kirch 2005-09-08 07:04:20 UTC
I'm pretty positive the patch above should fix it. If it doesn't, I'll
get back to this tomorrow.
Comment 49 Olaf Hering 2005-09-08 07:51:12 UTC
this patch fixes it.
Comment 50 Olaf Kirch 2005-09-08 14:45:42 UTC
Great! Thanks for verifying this. 
I committed the patch to the HEAD branch. 
Comment 51 Lars Müller 2005-09-19 10:30:52 UTC
I still have problems with the ipsec and RC4.  After some time (10-15 minutes)
my system got frozen.  And this only happens if I use ipsec.
Comment 52 Olaf Hering 2005-09-19 11:12:25 UTC
thats nice.
open a new bug. this one is about some other and already fixed crash.
Comment 53 Andreas Schwab 2005-10-09 10:43:01 UTC
Still not fixed. 
Comment 54 Andreas Schwab 2005-10-10 09:42:56 UTC
Yes. 
Comment 55 Olaf Kirch 2005-10-10 11:26:06 UTC
Please continue the discussion in bug #117749. 
Comment 56 Andreas Schwab 2005-10-10 12:05:14 UTC
No, that's a different bug. 
Comment 57 Olaf Kirch 2005-10-10 12:14:43 UTC
I'm willing to be educated... 
  
Please describe what symptoms you are seeing. 
Comment 58 Andreas Schwab 2005-10-10 12:37:07 UTC
See comment #1. 
Comment 59 Lars Müller 2005-10-24 14:33:21 UTC
I've updated the system to 2.6.14-rc5-20051021130504-default.

No crash till now but again several times 'Resource temporarily unavailable'.
Comment 60 Olaf Kirch 2005-11-11 14:16:29 UTC
I'm sorry, but with all the other stuff going on I currently don't have
the time to debug this properly.

Marius and/or Jiri, can you look into this please and try to reproduce?
Comment 61 Marius Tomaschewski 2005-11-25 19:05:33 UTC
I've tryed it in the last two weeks. What happens, is not directly a crash
but a system lockup/deadlock. And it depends on ipsec / ipsec triggers it.
Most of my tests was with openswan with NAT traversal behind a router, but
it happens with racoon as well.

Without ipsec the machines are running stable.

I've tryed it with the original 10.0 kernel-smp-2.6.13-15, the current one
(kernel-smp-2.6.13-15.4) from /work/CDs/all/full-10.0-i386 with enabled nmi_watchdog=1 as well as with kernel-debug-2.6.13-15.3 with kdb=on.

I never seen any Oops.

The kind of the lockup differs every time. Sometimes it happens directly
while ipsec establishes the SAs, sometimes it works fine for multiple
hours and "stops without any reason".

Interesting is, that the lockups are different to each other. Every time,
the machine closes all connections (e.g. a ssh to it) and serial console
and keyboard does not respond any more.

In many cases it is also not possible to ping the machine. But sometimes
(rarely) I can still access the internet behind the machine, except the
tunnelled networks and the machine itself.

For example, yesterday night I've transferred 18GB over the tunnel and
then the tunnel expired because the IP changes on the router.

In the hope, the current kernel-smp-2.6.13-15.4 works better, I restarted
the tunnel and contacted Lars via IRC to tell him about. And then, after
one half hour the lockup happened again.
Nothing worked (keyboard, serial console was not responding) and I phoned
Uwe to write Lars, that I've a lockup again. He wrote it and after ~2 min
the message he wrote, appeared in my IRC client.
This means, that the tunnel "works", but the machine is busy with something.

Strange.
Comment 62 Christoph Thiel 2006-01-25 16:06:55 UTC
What's the status of this bug? Isn't this a dup of bug #117749?
Comment 63 Lars Müller 2006-01-25 16:56:38 UTC
See comment #56
Comment 64 Christoph Thiel 2006-01-25 17:44:47 UTC
Alright, I see...
Comment 65 Olaf Kirch 2006-01-30 16:24:41 UTC
I committed a patch to CVS that may fix this problem. Please try the
upcoming KOTD.
Comment 66 Andreas Jaeger 2006-02-06 19:17:15 UTC
Christoph, is this fixed?
Comment 67 Christoph Thiel 2006-02-06 19:20:10 UTC
No, it isn't. With the patch from attachement #49085, I didn't had any crashes -- but that patch had some bad side effects in ipv6 AFAIK, that's why it's disabled right now. Olaf?
Comment 68 Christoph Thiel 2006-02-08 12:14:55 UTC
I'v been using/testing the latests patch in STABLE

* Mo Feb 06 2006 - mason@suse.de
- patches.fixes/xfrm-endless-loop: Fix infinite loop in
  xfrm_lookup (117749, 115200).  Add error_nopol: back

and didn't had a single crash since then. Just now I saw this messages in /var/log/messages:

Feb  8 13:09:56 t41p klogd: xfrm_lookup bailing out after 11 loops
Feb  8 13:09:56 t41p klogd:  [<c0277dac>] xfrm_lookup+0x9c/0x473
Feb  8 13:09:56 t41p klogd:  [<c024fc51>] ip_output+0x1e3/0x242
Feb  8 13:09:56 t41p klogd:  [<c0115912>] default_wake_function+0x0/0xc
Feb  8 13:09:56 t41p klogd:  [<c0276d49>] xfrm4_output_finish+0x8a/0x36a
Feb  8 13:09:56 t41p klogd:  [<c0248738>] ip_route_output_flow+0x77/0x8b
Feb  8 13:09:56 t41p klogd:  [<c02643bb>] raw_sendmsg+0x2e6/0x70f
Feb  8 13:09:56 t41p klogd:  [<c0263f61>] raw_recvmsg+0x6e/0x171
Feb  8 13:09:56 t41p klogd:  [<c026af1e>] inet_sendmsg+0x35/0x3f
Feb  8 13:09:56 t41p klogd:  [<c0229dde>] sock_sendmsg+0xd2/0xec
Feb  8 13:09:56 t41p klogd:  [<c0127a23>] autoremove_wake_function+0x0/0x2d
Feb  8 13:09:56 t41p klogd:  [<c0127a23>] autoremove_wake_function+0x0/0x2d
Feb  8 13:09:56 t41p klogd:  [<c0280df5>] _spin_unlock_irq+0x5/0x7
Feb  8 13:09:56 t41p klogd:  [<c027fc2b>] schedule+0x4dd/0x4f7
Feb  8 13:09:56 t41p klogd:  [<c01ae0c3>] copy_from_user+0x54/0x80
Feb  8 13:09:56 t41p klogd:  [<c022fa5a>] verify_iovec+0x3e/0x70
Feb  8 13:09:56 t41p klogd:  [<c022a209>] sys_sendmsg+0x186/0x1df
Feb  8 13:09:56 t41p klogd:  [<c022aa72>] sys_recvmsg+0x1a1/0x1b5
Feb  8 13:09:56 t41p klogd:  [<c0240002>] tc_ctl_tclass+0x1b8/0x1dc
Feb  8 13:09:56 t41p klogd:  [<c0139672>] get_page_from_freelist+0x6f/0x2c4
Feb  8 13:09:56 t41p klogd:  [<c012a110>] ktime_get+0x10/0x2c
Feb  8 13:09:56 t41p klogd:  [<c0280df5>] _spin_unlock_irq+0x5/0x7
Feb  8 13:09:56 t41p klogd:  [<c027fbe7>] schedule+0x499/0x4f7
Feb  8 13:09:56 t41p klogd:  [<c0139510>] __free_pages+0x3f/0x107
Feb  8 13:09:56 t41p klogd:  [<c0280e34>] _spin_lock_irqsave+0x9/0xd
Feb  8 13:09:56 t41p klogd:  [<c022b236>] sys_socketcall+0x183/0x19e
Feb  8 13:09:56 t41p klogd:  [<c011baa1>] sys_gettimeofday+0x27/0x53
Feb  8 13:09:56 t41p klogd:  [<c0102a5b>] sysenter_past_esp+0x54/0x79
Comment 69 Thorsten Kukuk 2006-02-15 14:54:45 UTC
this messages should have vanished with current kernels, can somebody please
retest if this is now fixed?
Comment 70 Dirk Mueller 2006-02-15 14:59:03 UTC
afaik the upstream patch hasn't been merged into the SUSE tree yet. 
Comment 71 Christoph Thiel 2006-02-15 14:59:30 UTC
For what I can tell it seems to be fixed -- it didn't crash at least after this change:

* Do Feb 09 2006 - okir@suse.de
- patches.fixes/xfrm-endless-loop: Updated with upstream patch (117749, 115200).


But I still have recurring "connection hangs", where we run into some kind of timeout (with ras-beta), which of course doesn't have much todo with this bug ;)

Feel free to close FIXED. 
Comment 72 Olaf Kirch 2006-02-16 15:54:30 UTC
Until someone proves me otherwise, I consider this as a duplicte of 117749

*** This bug has been marked as a duplicate of 117749 ***