Bug 146136

Summary: LTC21180-Unable to initialize qeth/dasd devices with CONFIG_DEBUG_SLAB enabled
Product: [openSUSE] SUSE Linux 10.1 Reporter: Jan Blunck <jblunck>
Component: KernelAssignee: Frank Pavlic <pavlic>
Status: VERIFIED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: big-iron, gjlynx, hannsj_uhl, ihno
Version: Beta 2   
Target Milestone: ---   
Hardware: S/390-64   
OS: Other   
Whiteboard:
Found By: Development Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: slab2616.diff
x3270 log

Description Jan Blunck 2006-01-27 12:12:03 UTC
When building the kernel with CONFIG_DEBUG_SLAB the dasd and qeth devices fail to come up. AFAIR there have been some alignment issues with CONFIG_DEBUG_SLAB on s390. I don't know if this is really a qeth/dasd problem or a generic problem with buffers and IO.

Frank, I still assign that one to you since we already discussed that issue in bug 144973 (LTC20908- qeth: Kernel oops (NULL pointer dereference)).

The problem with the dasds is that sometimes they are not detected at all. This depends from rebuild to rebuild.

The problem with qeth is as follows, when setting the device online:

s390vm01:/sys/bus/ccwgroup/drivers/qeth/0.0.0700 # echo 1 > online
echo 1 > online
qdio : received check condition on establish queues on irq 0.0.4 (cs=x20, ds=xc).
qdio : received check condition on activate queues on device 0.0.0702 (cs=x20, ds=xe).
qeth: Recovery of device 0.0.0700 started ...
qeth: Device 0.0.0700 could not be recovered!
qeth: sense data available on channel 0.0.0700.
qeth:  cstat 0x0
 dstat 0xE
qeth: irb: 00 c2 60 17  0c ba 10 48  0e 00 10 00  00 80 00 00
qeth: irb: 01 02 00 00  00 00 00 00  00 00 00 00  00 00 00 00
qeth: sense data: 02 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00
qeth: sense data: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00
qeth: Recovery of device 0.0.0700 started ...
qdio : got interrupt for queues in state 3 on device 0.0.0702?!
qdio : got interrupt for queues in state 3 on device 0.0.0702?!
qeth: Initialization in hardsetup failed! rc=-5
qeth: Retrying to do IDX activates.
qdio : got interrupt for queues in state 3 on device 0.0.0702?!
s390vm01:/sys/bus/ccwgroup/drivers/qeth/0.0.0700 # qeth: Retrying to do IDX activates.
qdio : got interrupt for queues in state 3 on device 0.0.0702?!
qeth: Retrying to do IDX activates.
qdio : got interrupt for queues in state 3 on device 0.0.0702?!
qeth: Initialization in hardsetup failed! rc=-62
qeth: Device 0.0.0700 could not be recovered!
Comment 1 LTC BugProxy 2006-01-31 15:50:11 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Owner|gjlynx@us.ibm.com           |h.carstens@de.ibm.com




------- Additional Comments From pavlic@de.ibm.com  2006-01-31 10:48 EDT -------
Reassigning this bugzilla to Heiko ...

Frank 
Comment 2 LTC BugProxy 2006-02-14 19:15:25 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Owner|h.carstens@de.ibm.com       |pavlic@de.ibm.com
           Severity|normal                      |low




------- Additional Comments From h.carstens@de.ibm.com(prefers email via heiko.carstens@de.ibm.com)  2006-02-14 14:09 EDT -------
Without deeper knowledge of QDIO I\'m not able to debug this. Frank, why are
these check conditions generated? 
Comment 3 LTC BugProxy 2006-04-10 12:00:26 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |cborntra@de.ibm.com




------- Additional Comments From cborntra@de.ibm.com  2006-04-10 07:55 EDT -------
I have no clue about the dasds (you dont mean fcp devices?) but it seems that I 
have found the alignment problem in qdio.c.

The qib structure must be aligned to 256 Bytes AND is enmbedded into the 
qeth_irq structure. The alignment cannot be guaranteed with slab debugging.
If you force the qeth_irq structure to a page boundary qeth works for me with 
CONFIG_DEBUG_SLAB. See the patch below (the whitespaces are broken due to cut 
and paste into this bugzilla)

diff -u -p -r1.3 qdio.c
--- drivers/s390/cio/qdio.c     4 Apr 2006 07:25:26 -0000       1.3
+++ drivers/s390/cio/qdio.c     10 Apr 2006 11:51:54 -0000
@@ -1637,7 +1637,7 @@ next:

        }
        kfree(irq_ptr->qdr);
-       kfree(irq_ptr);
+       free_page((unsigned long) irq_ptr);
 }

 static void
@@ -2984,7 +2984,7 @@ qdio_allocate(struct qdio_initialize *in
        qdio_allocate_do_dbf(init_data);

        /* create irq */
-       irq_ptr=kmalloc(sizeof(struct qdio_irq), GFP_KERNEL | GFP_DMA);
+       irq_ptr=(void *) get_zeroed_page(GFP_KERNEL | GFP_DMA);

        QDIO_DBF_TEXT0(0,setup,\"irq_ptr:\");
        QDIO_DBF_HEX0(0,setup,&irq_ptr,sizeof(void*));
@@ -2994,14 +2994,13 @@ qdio_allocate(struct qdio_initialize *in
                return -ENOMEM;
        }

-       memset(irq_ptr,0,sizeof(struct qdio_irq));

        init_MUTEX(&irq_ptr->setting_up_sema);

        /* QDR must be in DMA area since CCW data address is only 32 bit */
        irq_ptr->qdr=kmalloc(sizeof(struct qdr), GFP_KERNEL | GFP_DMA);
        if (!(irq_ptr->qdr)) {
-               kfree(irq_ptr);
+               free_page((unsigned long) irq_ptr);
                QDIO_PRINT_ERR(\"kmalloc of irq_ptr->qdr failed!
\");
                return -ENOMEM;
                }

Let me knwo if this patch works. 
Comment 4 LTC BugProxy 2006-04-10 12:01:07 UTC
Created attachment 77517 [details]
slab2616.diff
Comment 5 LTC BugProxy 2006-04-10 12:01:09 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Owner|pavlic@de.ibm.com           |cborntra@de.ibm.com




------- Additional Comments From cborntra@de.ibm.com  2006-04-10 07:58 EDT -------
 
patch against 2.6.16 which makes qeth work CONFIG_SLAB_DEBUG

please test this patch and let me know if it works. we can then mak an official
patch. 
Comment 6 Hanns-Joachim Uhl 2006-04-25 13:26:51 UTC
Hello Jan,
I am assigning this bugzilla back to you ...
... can you please test the attached patch whether it resolves the problem in this bugzilla ..?
Thanks in advance for your support.
Comment 7 Jan Blunck 2006-04-26 17:53:48 UTC
Created attachment 80349 [details]
x3270 log

Thanks for the patch. I have good and bad news. The QDIO issue seems to be fixed. But the SLAB debugger found the following bug in our latest kernel:

Unable to handle kernel pointer dereference at virtual kernel address 6b6b6b6b6b6b6000 
Oops: 0038 [#1] 
CPU:    0    Not tainted 
Process ifup (pid: 1211, task: 00000000011a2150, ksp: 000000000d56bd88) 
Krnl PSW : 0704200180000000 0000000010a983f6 (qeth_hard_start_xmit+0x1dda/0x2218
 [qeth]) 
Krnl GPRS: 0000000000000006 6b6b6b6b6b6b6ba5 0000000000000001 0000000000000000 
           0000000010a97c94 0000000000000001 000000000f8a8000 000000000f8a9c10 
           0000000000e57000 000000000f8a0000 0000000000000000 000000000f5b9bd0 
           0000000010a8e000 0000000010abb948 0000000010a97c94 00000000012b1b20 
Krnl Code: bf 43 10 06 a7 84 00 13 58 50 f0 f0 12 55 a7 84 00 0e 58 10  
Call Trace: 
([<0000000010a97c94>] qeth_hard_start_xmit+0x1678/0x2218 [qeth]) 
 [<00000000003907fa>] qdisc_restart+0x13e/0x280 
 [<0000000000375376>] dev_queue_xmit+0x496/0x718 
 [<0000000010a443f8>] mld_sendpack+0x32c/0x4ec [ipv6] 
 [<0000000010a49196>] mld_ifc_timer_expire+0x316/0x3c0 [ipv6] 
 [<00000000001549f4>] run_timer_softirq+0x660/0x704 
 [<0000000000149950>] __do_softirq+0x6c/0x108 
 [<000000000010f226>] do_softirq+0xba/0xf4 
 [<0000000000110034>] ext_no_vtime+0x16/0x1a 
 [<00000000001c01ae>] do_wp_page+0x10e/0x4e0 
([<00000000001c0184>] do_wp_page+0xe4/0x4e0) 
 [<00000000001c753c>] __handle_mm_fault+0xcc4/0xdcc 
 [<0000000000101a98>] do_protection_exception+0x1c0/0x450 
 [<000000000010f95a>] sysc_return+0x0/0x10 
 [<0000020000180138>] 0x20000180138 
 
 <0>Kernel panic - not syncing: Fatal exception in interrupt 
01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from
 CPU 00.
00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 00103A44
Comment 8 Jan Blunck 2006-04-27 13:46:07 UTC
First look at the problem:

This seems to be in qeth_send_packet() (line 4506 in qeth_main.c):

		rc = qeth_do_send_packet_fast(card, queue, skb, hdr,
					      elements_needed, ctx);
	if (!rc){
		card->stats.tx_packets++;
		card->stats.tx_bytes += tx_bytes;
#ifdef CONFIG_QETH_PERF_STATS
		if (skb_shinfo(skb)->tso_size &&      <======= here
		   !(large_send == QETH_LARGE_SEND_NO)) {
			card->perf_stats.large_send_bytes += skb->len;
			card->perf_stats.large_send_cnt++;
		}
 		if (skb_shinfo(skb)->nr_frags > 0){
			card->perf_stats.sg_skbs_sent++;
			/* nr_frags + skb->data */
			card->perf_stats.sg_frags_sent +=
				skb_shinfo(skb)->nr_frags + 1;
		}
#endif /* CONFIG_QETH_PERF_STATS */
	}
	if (ctx != NULL) {
		/* drop creator's reference */
		qeth_eddp_put_context(ctx);

I looked into all the skb handling in qeth_do_send_packet(_fast) but I don't see why the shinfo is already freed.

Frank, can you take a look?
Comment 9 Jan Blunck 2006-05-07 23:52:45 UTC
Christian,

thanks for the fix. The problem with CONFIG_DEBUG_SLAB is fixed by your patch.
Second problem is fixed by IBM Codestream linux-2.6.16 october2005 patch 02-19, thanks to Frank.

Both patches in CVS.
Comment 10 LTC BugProxy 2006-05-08 11:00:13 UTC
----- Additional Comments From cborntra@de.ibm.com  2006-05-08 06:57 EDT -------
Yes, I have found the same probe in qeth with slab debugging. qeth should work 
now with slab debugging.
The slab debugging fix is now upstream as well. 
Comment 11 LTC BugProxy 2006-05-16 15:30:21 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ACCEPTED                    |CLOSED
             Impact|------                      |RAS




------- Additional Comments From cborntra@de.ibm.com  2006-05-16 11:27 EDT -------
slab debugging fix is in SLES10 RC1. 
Comment 12 Ihno Krumreich 2006-06-06 15:29:50 UTC
Closed.