Bug 154291

Summary: PCI: Multiple Domains Are Not Supported ( kernel warning/error )
Product: [openSUSE] SUSE LINUX 10.0 Reporter: Ric Johnson <fhj-52>
Component: KernelAssignee: Greg Kroah-Hartman <gregkh>
Status: RESOLVED WONTFIX QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P5 - None CC: fhj-52
Version: FinalKeywords: release_note
Target Milestone: ---   
Hardware: x86-64   
OS: SuSE Linux 10.0   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: List of supported OS for GA-2CEWH-RH mainboard

Description Ric Johnson 2006-03-01 08:47:52 UTC
Desciption : Devices on PCI domain greater than PCI0 cannot be accessed.

Steps to replicate : Install SL10.0 on mainboard with more than one PCI domain.

Actual Results : Unable to access PCI-X slots/devices, etc.

Expected Results : Normal use of all devices on mainboard.

Explanation:
Installation fails to use all PCI* on systems that have multiple PCI domains on the mainboard.  In my case PCI1 and PCI2 are not available.  That means I have no access to devices in those domains, none the least of which are the PCI-X slots.
The reason is that there has been no support in the 'stock' linux kernel for multiple domians on the x86 arch until recently. 
In that respect, one might consider whether this is a bug or a feature.  Since this is a basic system problem here and is not adding some special device, I can only suggest it is a bug. It is a blocker for anyone who want to use Novell/SuSE Linux with new-er x86 mainboards that have PCI, PCIe & PCIX slots and devices.

Support for multiple PCI domains on the x86 architecture is needed for 
compatibility with current and future computer mainboards.

Background, available fixes, etc:
Multiple PCI domains are also known as segments by ACPI. These are used by 
newer mainboards for support of the multiple types of PCI bus that now exist on 
the x86 arch, including the PCI-X aa well as the (non-bus) PCIe.

Support for the x86 arch has just been recently patched into the kernel(Garzik 
patch in December 2005) and the patch even more recently updated. They are in A. 
Morton's  -mm  tree. Info is here:
http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc4/2.6.16-rc4-mm2/announce.txt
Patches are available from 
http://kernel.org/pub/linux/kernel/people/akpm/
The breakouts (if you want to look at the patches) are here:
http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc4/2.6.16-rc4-mm2/broken-out/
Patches included are:
 gregkh-pci-pci-fix-the-x86-pci-domain-support-fix.patch
 gregkh-pci-x86-pci-domain-support-a-humble-fix.patch   
 gregkh-pci-x86-pci-domain-support-struct-pci_sysdata.patch 
 gregkh-pci-x86-pci-domain-support-the-meat.patch  
 revert-gregkh-pci-x86-pci-domain-support-the-meat.patch 

They are listed using prefix '' gregkh-pci-x86- '' -> signed-off by Greg K.H.  Of course that info changes quickly; there may be new updates already.

From what I can determine, they are now fixed(?) in  2.6.16-rc4-mm2.bz2  as  
struct-pci_sysdata patches to add '' CONFIG_PCI_DOMAINS '' to the kernel 
config among other things. 

It is not just a 64-bit OS problem since the 32-bit versions fail to install also with same error message or install with reduced capacity. 

Support for multiple PCI domains for the x86 arch is needed as soon as possible 
to insure compatibility for installation and use of this linux OS. (otherwise I have to go back to Redmond...)

In my case, this is a Gigabyte GA-2CEWH-RH Dual Opteron server/workstation mainboard with Phoenix BIOS.
Here are some brief specific info from the boot log:
 ...
 <6>ACPI: PCI Root Bridge [PCI1] (0001:80)
 <4>PCI: Multiple domains not supported
 <4>    ACPI-0279: *** Warning: Bus 0001:80 not present in PCI namespace
 <4>    ACPI-0167: *** Warning: Invalid ACPI-PCI context for parent device PCI1
 <4>    ACPI-0167: *** Warning: Invalid ACPI-PCI context for parent device PCI1
 ...
 <6>ACPI: PCI Root Bridge [PCI2] (0002:40)
 <4>PCI: Multiple domains not supported
 <4>    ACPI-0279: *** Warning: Bus 0002:40 not present in PCI namespace
 <4>    ACPI-0167: *** Warning: Invalid ACPI-PCI context for parent device PCI2
 <4>    ACPI-0167: *** Warning: Invalid ACPI-PCI context for parent device PCI2
 ...

That's the gist of it. The longer output shows that (ACPI) PCI Interrupt Links are disabled.
I can supply the hardware configuration from YaST2 if/when needed.
Comment 1 Olaf Kirch 2006-03-01 09:50:23 UTC
Hi Ric,

this should be supported in the upcoming SL10.1/SLES10 kernel. We do not
backport hardware support to older Suse Linux kernels though. As such I
need to close this as WONTFIX. Please do give the SL10.1 beta a try.
Comment 2 Greg Kroah-Hartman 2006-03-01 17:20:16 UTC
Olaf, sorry, but not, this is not supported in SL10.1 or SLES10, so trying
out a new kernel will not help at all.

The problem is that we have some initial attempts at doing this, but when running
them, they break other types of systems (see lkml for the bug reports), so the
majority of this patch is backed out in the -mm series due to that.

It will take some access to one of these machines (local access is preferred), and
some time to get it all working properly.  And that will have to happen after
SLES10 is released.

So sorry, Linux currently does not support this kind of hardware very well yet.
Comment 3 Ric Johnson 2006-03-01 19:16:23 UTC
First, thanks. You guys are great. :)
Did you know that I initially wanted SuSE as my Linux of choice 6 years ago but did not primarily because the license model was too confusing for a newbie?  I am sure you did not but I mention that because I have some years of Linux experience(primarily RPM distro), have built RPMs & kernels, successfully, in the past and am available to work on this very significant problem.  
I don't claim to be a kernel expert and my "SuSE" experience is extremely low but I think I still have enough enthusism to make up the difference.

I am rather upset with Gigabyte because they advertise this mobo as compatible with both 64-bit and 32-bit flavors of Linux, in particular SuSE( as well as RH ).  I mention that because in email conversations last week they said they would work on it.  They should provide some assistance if/when needed. 

Olaf Kirch : 
Please do not close as won't fix.  This is a current as well as future problem for Novell & SuSE, as well as the Linux Operating System for x86 arch in general.  I am willing to post/transfer-to as a 10.1 bug if that is what is needed( but I am not running the beta at this time ).  Closing as 'wont fix' has ramifications that are not good.

Greg Kroah-Hartman :
I had hoped that Gigabyte would contact you, Greg, as well as Garzik and Morton about getting the support into the kernel. I guess not yet, huh?
I recognize part of the problem is that these are new, cutting-edge mainboards and few in number in the Linux world.  However this one is available but it is the only 64-bit system here. There are other 32-bit x86 (P, PII & PIII) systems that could be used for testing...
I read about AM's problem on the older system in his mailing list post and that it might have possibly buggy ACPI(?).  I have read everything I could find about this issue, which, frankly, is not much so have been hesitant to just jump into the water, so to speak.  I really do not know, yet, still, what is under the surface.  I am not a developer (although I do c) and certainly not a kernel hacker.

The point here, really, is that I am asking what is the best way to proceed to get this done?  

( Please email me, if needed. )
Comment 4 Greg Kroah-Hartman 2006-03-01 19:25:53 UTC
Ok, I'll take this one
Comment 5 Greg Kroah-Hartman 2006-03-01 19:28:48 UTC
Yes, it would be best if Gigabyte were to contact me (my email address is very
easy to find as the kernel PCI maintainer.)  I would be more than willing to
work together with them to get Linux working properly on their machines.

The fact that they are advertising that it all works is odd, I'd be interested
in finding out what they are doing to get that to work for these devices.

And no, the problem isn't for buggy ACPI issues, but on other platforms that
have multiple PCI domains (NUMA boxes from IBM), that currently work just fine
with Linux.  We can't break them with this patch, as you can understand.

I'll mark this LATER, to remind me if someone contacts me about this in the
future.
Comment 6 Greg Kroah-Hartman 2006-03-01 19:29:35 UTC
marking LATER to remind me in the future.
Comment 7 Ric Johnson 2006-03-01 20:15:34 UTC
Thanks, :) .
Gigabyte(GBT) advertises this mobo on their front page,  http://us.giga-byte.com/ and just ended a February promotion.  It is, presumably, their flagship AMD  Dual Opteron(TM) graphics workstation/server product.  It supports NUMA too.
There is a PDF doc of "supported" OS via http://us.giga-byte.com/Server/Support/OSSupport/OSSupport_ServerBoard_GA-2CEWH.htm which I am going to upload here( for reference ).

Did not see the IBM problem(I'm not receiving lkml) & agree, of course, that a patch should not, generally, bust other things.

GBT support has been responsive to other issues.  I'll ping GBT ...
Comment 8 Ric Johnson 2006-03-01 20:23:44 UTC
Created attachment 70864 [details]
List of supported OS for GA-2CEWH-RH mainboard

OS Compatibility for GA-2CEWH 
(Updated PDF version)
Comment 9 Ric Johnson 2006-03-06 19:15:24 UTC
I did ping GBT and they have responded that this info is being forwarded to "HQ" for the BIOS team to use. I hope they have made contact.

As for me, I have the recent kernels(2.6.15.5 & 2.6.16-rc5) as well as the -mm patchset(2.6.16-rc5-mm2) to attempt build and use for testing.
It has been delayed 'cause of a little setup problem with SuSE due to lack of clear and valuable info on the *proper* setup for building in the SuSE environment(i.e., I have newbie-itis, :) ) ... a couple more days, probably.
Comment 10 Ric Johnson 2006-03-15 01:50:57 UTC
I am speaking from
#> uname -r
2.6.16-rc6-mm1-smp
and it has made NO difference. (I did not have any better results with previous kernels ... this is the freshest.)
 
Something is a bit haywire because the CONFIG_PCI_DOMAIN was NOT available during the config, the multiple configs for multiple attempts, thanks (mostly) to a missing object file that was not really missing because it was not supposed to be in the make in the first place and because I kept trying to find the sucker.  
It is in the patch. Why is it not in the config?

I'll forego printing out the confirmation data that it does not work( PCI: Multiple domains not supported ) to provide the PCI-X access as well as the rest of my mainboard since it is probably a (kernel) config oversight.

What could it be?


I have saved logs if needed... kernel as well as PCI debug is in too.
Comment 11 Ric Johnson 2006-03-15 14:31:06 UTC
I took a (closer) look at what we built. The
CONFIG_PCI_DOMAINS=y
seems to be available for just about every arch on the planet Earth EXCEPT i386 and x86_64 (amd64).
Why and how do I fix that so I can test it on this Dual Opteron workstation?
Comment 12 Greg Kroah-Hartman 2006-03-16 00:00:10 UTC
Andrew has dropped the patches from his tree (well, he has a revert for them)
as they caused too many problems.  I'm about to drop them too.  If you want,
I can email them to you offline, or point you to them on the web.
Comment 13 Ric Johnson 2006-03-16 07:43:56 UTC
Thanks for answering. :)

Any others besides these?:
gregkh-pci-pci-fix-the-x86-pci-domain-support-fix.patch
gregkh-pci-x86-pci-domain-support-a-humble-fix.patch
gregkh-pci-x86-pci-domain-support-struct-pci_sysdata.patch
gregkh-pci-x86-pci-domain-support-the-meat.patch

I have those from the ' Broken-out ' at Morton's site.

You said there are "Too many problems"? - Is that on the lkml? 
Please do point me to any public discussion; I have to make some kind of informed decision.
 
I am a little confused as to why a patch, the only patch that I could find ever existed for x86 Multiple PCI Domains(MPD), is just going to be dropped rather than fixed.  It seems rather important since even Win2k supports MPD.

I suppose that also means that GBT never contacted you. Yes/no?

And thanks for the claification.  I was a bit bugged since I thought I had done all that work incorrectly.
You may certainly email me if you prefer.
Comment 14 Ric Johnson 2006-03-16 20:27:14 UTC
There are some odd things with the sysdata & meat patches comments. They seem to be missing the closing "*/" in some places so I guess I'll need the real thing to make sure there is not a C&P error somehow or something else.
(I could not find them at the ' .../people/jgarzik/patches ' ...)
I am a simple, by-the-book, sometimes-if/when-necessary programmer so I do not know what/why the sysdata & meat patches do that.
Comment 15 Ric Johnson 2006-03-16 21:54:37 UTC
I got it. (I have not been doing enough patches to be familiar with format). 
If you will verify to me that the complete patchset is as I listed I will try to do something here... ( this GBT mainboard is _worthless_ to me w/o linux at full tilt. )
Comment 16 Greg Kroah-Hartman 2006-03-16 23:17:24 UTC
No, no one has contacted me.

And the patches are being dropped, as no one is working on them to fix the
remaining issues.

They can all be found at:
  http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/bad/pci-domain/

Look at the file, "series" in that directory for the order in which the patches
should be applied.

good luck.
Comment 17 Ric Johnson 2006-03-17 15:00:06 UTC
Thanks, :).

Are the "remaining issues" colated anywhere?

I cannot predict the future but MS Win2k (& later) support the multiple PCI domains so I am relatively sure that this will be an ongoing issue, with, at least, advanced mobos using multiple types of PCI.  


And thanks for the luck, too -will need it as it seems that GBT is being more difficult than necessary.  Vendors ... :rollseyes:
Comment 18 Greg Kroah-Hartman 2006-03-17 17:20:02 UTC
No, the issues aren't collected anywhere, except that some boxes are still
crashing with these patches (that work fine today, we can't allow that.)

And sure, other operating system probably support this just fine, I know, but
unless someone does the work here, it's not going to happen for Linux, that
is just how this project works :)
Comment 19 Stephan Kulow 2008-06-25 09:36:01 UTC
mass reopening all SuSE Linux bugs that are set to REMIND+LATER to change the resolution to WONTFIX (adapting to new policy)
Comment 20 Stephan Kulow 2008-06-25 09:38:00 UTC
mass reopening all SuSE Linux bugs that are set to REMIND+LATER to change the resolution to WONTFIX (adapting to new policy)
Comment 21 Stephan Kulow 2008-06-25 09:42:19 UTC
mass reopening all SuSE Linux bugs that are set to REMIND+LATER to change the resolution to WONTFIX (adapting to new policy)
Comment 22 Stephan Kulow 2008-06-25 09:53:44 UTC
Closing old LATER+REMIND bugs as WONTFIX - if you still plan to work on it, feel free to reopen and set to ASSIGNED.

In case the report saw repeated reopen comments, it's due to bugzilla timing out on the huge request ;(