|
Bugzilla – Full Text Bug Listing |
| Summary: | getaddrinfo() breaks when resolv.conf points at buggy nameservers | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE 11.1 | Reporter: | Freek de Kruijf <freek> |
| Component: | Basesystem | Assignee: | Stephan Kulow <coolo> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Critical | ||
| Priority: | P4 - Low | CC: | aharrison, aj, behlert, chris.beckett, chusty, coolo, dmueller, forgotten_JtaKqlU8J9, forgotten_oT5fNj878H, kukuk, lnussel, roger, t.zell, uli.2001, vojtech |
| Version: | Beta 4 | ||
| Target Milestone: | --- | ||
| Hardware: | Other | ||
| OS: | Other | ||
| Whiteboard: | maint:released:11.1:22237 | ||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Bug Depends on: | |||
| Bug Blocks: | 618050 | ||
| Attachments: |
bugzilla.tbz
strace of wget strace of nscd when ping failed nscd strace nsswitch.conf screen shot of unresolved ping logfile while doing the pings nscd trace unable to resolve dns -=terry=- nscd after install glibc from "comment 34" -=terry=- nscd trace after installing glibc from comment #83 -=terry=- |
||
|
Description
Freek de Kruijf
2008-11-05 18:19:46 UTC
Doesn't sound like a KDE problem if it also happens with YaST. I managed to have my wifi working and I have the same problem. Not only dig works but I can ping www.suse.com in the console. Below is the conversation to show more clearly the problem. Note that the problem is not restricted to GUI applications, also curl has this problem. freek@linux:~> date za nov 8 16:52:22 CET 2008 During installation the following command is given to test the network connection: freek@linux:~> curl --silent --show-error --max-time 45 --connect-timeout 30 'http://www.suse.com' curl: (6) Couldn't resolve host 'www.suse.com' freek@linux:~> date za nov 8 16:52:32 CET 2008 freek@linux:~> dig www.suse.com ; <<>> DiG 9.5.0-P2 <<>> www.suse.com ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51844 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;www.suse.com. IN A ;; ANSWER SECTION: www.suse.com. 292 IN A 195.135.220.3 ;; Query time: 12 msec ;; SERVER: 10.0.0.138#53(10.0.0.138) ;; WHEN: Sat Nov 8 16:52:37 2008 ;; MSG SIZE rcvd: 46 freek@linux:~> date za nov 8 16:52:40 CET 2008 freek@linux:~> ping www.suse.com PING www.suse.com (195.135.220.3) 56(84) bytes of data. 64 bytes from turing.suse.de (195.135.220.3): icmp_seq=1 ttl=53 time=23.2 ms 64 bytes from turing.suse.de (195.135.220.3): icmp_seq=2 ttl=53 time=23.0 ms 64 bytes from turing.suse.de (195.135.220.3): icmp_seq=3 ttl=53 time=22.5 ms 64 bytes from turing.suse.de (195.135.220.3): icmp_seq=4 ttl=53 time=23.1 ms 64 bytes from turing.suse.de (195.135.220.3): icmp_seq=5 ttl=53 time=23.1 ms 64 bytes from turing.suse.de (195.135.220.3): icmp_seq=6 ttl=53 time=23.6 ms ^C --- www.suse.com ping statistics --- 6 packets transmitted, 6 received, 0% packet loss, time 5019ms rtt min/avg/max/mdev = 22.540/23.111/23.628/0.341 ms freek@linux:~> Works fine for me. Does it work if you disable IPv6 in the YaST network module and reboot? Ah! I was wondering why it does work now. And indeed I disabled IPv6. Can you please paste your /sbin/ifconfig output, your /etc/resolv.conf, /etc/host.conf and /etc/nsswitch.conf? Created attachment 251174 [details]
bugzilla.tbz
Tar file with the 4 requested files.
Please note that the problem is in the IPv6 setting.
If IPv6 is enabled curl, Konqueror etc. can't resolv names like www.suse.com
Disabling IPv6 and rebooting there is no problem.
Enabling IPv6 and rebooting brings the problem back.
*** Bug 441868 has been marked as a duplicate of this bug. *** I don't know how to disable IPv6 in the kernel loaded from the installation DVD. Is there a boot option? For the generated kernel, see comment #7 From the logs in bug 441868 it looks like the error happens after configuring network. So i guess you can disable IPv6 in the very first network configuration proposal, before performing the connection test. I did a new installation again the naieve way, so I did not change the default of automatic configuration. This did not give me the option to change the IPv6 setting. Next I did change the option of automatic configuration and indeed I have the option to change the IPv6 setting. However here I got the warning that I have to reboot for this option to take effect. So immediately after that, the network test and the downloading of the release notes did NOT succeed. So please solve the problem with IPv6 enabled, so the test succeeds and downloading the release notes is still possible. Problem still exists in Beta 5 rc1 should hopefully have glibc-2.9 with some extra libresolv bugfixes related to this - it seems to me that they should fix this problem too. Are there any updates? Did you hit the bug again? Cc'ing Stephan: This was the only [*] known major issue around glibc-2.10. To sum up: With some broken nameservers, e.g. in some DSL modems, getaddrinfo() does not work, turning off IPv6 or using different nameservers works around the issue. This appeared to be fixed just before final glibc-2.10, but according to the latest reports (http://sourceware.org/bugzilla/show_bug.cgi?id=7060) this issue might still exist in final glibc-2.10. However, lack of bugreports against rc1 seems to indicate that this issue is not that widespread. Upstream is touching the resolver-related code again lately, but does not hint whether/which bugs that should fix and the communication is difficult as usual. I might try to look at this issue, but touching the resolver code might be risky. We could also just document the workaround in release notes, but it's hard to say if this will end up being a real issue. [*] Also, nscd is generally unstable. I've already fixed two bugs in nscd (unrelated to each other) but each time I fix a bug, it only means nscd runs long enough to uncover more. Since in 11.0, nscd wasn't stable either, I think it's fine to stabilize nscd in later updates, though the two bugs I have already fixed should be a major increase in stability. The problem is still in RC1 both x86 and x86_64. First I have to disable IPv6 before IP works from applications. As mentioned before ping and dig do work when IPv6 is enabled. NEEDINFO on coolo about how to handle this for 11.1. We should of course try to get this fixed for SLES11 anyway, I will try to investigate this in detail later. both bugs are reported from Freek, so it sounds unlikely it will be such a generic problem that need to scare people into disabling ipv6. We'll update the release notes when I turn out to be wrong :) Indeed the problem comes from the nameserver in my ADSL modem. When I fill in, by hand, an outside nameserver the problem does not exist anymore. However, only with 11.1, I have this problem. Yes, this seems the same as that redhat bug. I was planning to revert to the separate DNS queries behavior as well, but probably only in january. I'm having problems that sound related to this bug report. I've been trying to get help in the mailing list, to no avail. Thread url: http://lists.opensuse.org/opensuse/2008-12/msg00879.html In a nutshell, everything worked fine with 10.2. I performed a fresh install of 11.1rc1, now lots and lots of dns resolution problems. Same name servers used, ended up even restoring a copy of my resolv.conf from my 10.2 box, still no joy. Firefox, konq, ssh, telnet, whois, etc, all fail most of the time when trying to resolve names, always requiring lots of retries. Yet the dig command never fails. Any suggestions of other things I can try? Any other info I can provide besides what I posted to the list? Additional info I forgot to include. One reason I think this bug is related to my problems is that the whois command fails complaining about the exact system call in this bug title: getaddrinfo(whois.crsnic.net): Name or service not known. Sorry but this is a different issue if disabling IPv6 does not help; please open a separate bugreport. Done. Bug id 462769 opened. http://www.suse.de/~pbaudis/bug-441947/ now contains test glibc build that should fix all these issues - please test; after testing, we will provide glibc update. Summary of my current understanding of the problem based on various dumps, for newcomers: The gethostbyname4() lookup method is problematic since it fires out both the A and AAAA DNS queries in parallel and over the same socket. This should work in theory, but it turns out that many cheap DSL modems and similar devices have buggy DNS servers - if the AAAA query arrives too quickly after the A query, the server will generate only a single reply with the A query id but returning an error for the AAAA query; we get stuck waiting for the second reply. For gethostbyname4() users affected, disabling IPv6 in the system might work around the issue, unfortunately it only helps with applications using AI_ADDRCONFIG (e.g. Firefox); some (notably e.g. Pidgin) neglect to do that. Real fix should be using separate ports for the A and AAAA queries. *** Bug 442572 has been marked as a duplicate of this bug. *** Petr, thank you for the test packages and your explanation of the problem. I will test these as soon as possible on a few devices at home and at the office tomorrow. One of the reasons I bought a [very] cheap router was to see the effects that Guy was seeing at home for this bug. Previously I was using a Linux based router that did its own DNS lookup and caching, thus hiding the issue from me (my modem/ISP exhibits the issue you describe). Am I correct in assuming that the fix you provide in your test packages effectively falls us back to the routine used in SLED 10 SP2 or openSUSE 11.0 and older? In other words, this fix shouldn't be seen as a regression but neither as a progression? Of course a "real fix" would be preferred, but 11.2 is certainly the most likely candidate for that, and not SLE11. Additionally, this bug warrants an online update for openSUSE 11.1 users. ... Now for a matter of preserving affected persons and priorities, I am adjusting properties of this bug. Again, thanks Petr, and I'll have confirmation to follow. Just a quick reminder for those who will test Petr's package: make sure you turn IPv6 back ON in YaST if you disabled it earlier to help mediate this issue. Okay, confirmed this fix is working for me on one of the machines (an MSi Wind Netbook) that was hurt the most by this bug (slow resolves + application level timeouts colluding due to the overall poor performance of the machine). I'll check tomorrow at the office. Thanks Petr! Petr, thank you so much. I am not in a position to test the fix from home as I am away. I trust that Aaron was able to duplicate my environment since we use the same ISP in Boston. Thank you again. *** Bug 462769 has been marked as a duplicate of this bug. *** Petr: would you please make x86_64 packages available for testing. Two machines affected by this at the office are x86_64. Guy: do you have any accessible hardware in your office you would like me to confirm this fix on today? I put the x86_64 packages I generated here if it wants to grab them... http://www.metrocast.net/~aharrison/openSuSE-11.1/ I can confirm this fix is working on 3 other i586 machines. I can confirm this fix is working on 2 x86_64 machines. *** Bug 465538 has been marked as a duplicate of this bug. *** *** Bug 442572 has been marked as a duplicate of this bug. *** From bug 442572: ------- Comment #70 From Kelli Frame novellonly 2009-01-10 14:11:14 MST (-) [reply] ------- Private These packages have made things worse for me. I now have to load all web pages at least twice to in order to get them to load. The first time I hit a page, it is unavailable (e.g., Address Not Found - Firefox can't find the server at www.google.com.). If it reload one or more times, it will finally load. Before installing these packages my web browsing and resolution was fine. Now it's broken in addition to Pidgin. As for my Pidgin problem, I can now load AIM about half the time, but no other services (GWIM, MSN, Google). -------- Kelli, can you please verify if this happens with nscd turned off? Does this problem happen with wget as well? Do things work for you if you replace the nameserver lines in /etc/resolv.conf with just nameserver 62.24.64.27 temporarily? Can you post a strace -s 4096 of the wget while it is broken? Created attachment 264896 [details]
strace of wget
Note that since installing the new packages, I started having problems both at work and at home. I am answering now for what happens on the Novell network, and will answer for home later tonight. When I turned nscd off, everything worked perfectly. When I turned it back on again, wget behaves the same as Pidgin and Firefox. I've attached the strace of trying to wget to http://www.nytimes.com. It took 2-3 tries for it to be successful. Same experience on home wireless network as at Novell. If I turn off nscd, everything works. If not, Pidgin does not work and I have to reload web pages 2-3 times before they load. Petr, any idea? Could this be caused by the avahi-daemon? *** Bug 463015 has been marked as a duplicate of this bug. *** I'm who reported the bug 463015 that is the duplicate of this one. I have one system with opensuse 11.1 from the retail box without wireless were I can get whatever information you need. If you need a fresh reinstall I can do it. Thanks for working in this bug -=terry=- Petr, any update? Any additional information required? I can reproduce this here on a fresh installation of SLED 11 RC-candidate. And it gets interesting: A ping to an address worked, slogin immediately after that did not. On a different shell a few minutes later the slogin was successfull. I'll attach a screenshot. I did also a few straces of nscd while the ping was not working, and when it worked. Will attach them in a minute. Created attachment 265375 [details]
strace of nscd when ping failed
strace -s 4096 -f -ooutput.txt -p <nscd>
when ping fails
Created attachment 265376 [details]
nscd strace
strace, this time when ping works due to resolved address.
It seems like bad interaction between NSS modules, somehow sometimes nscd never tries the DNS NSS module. What's in nsswitch.conf? Still cannot reproduce anywhere... Created attachment 265489 [details]
nsswitch.conf
(I couldn't reproduce the problem yet. If anyone can reproduce it consistently, please install nscd from http://www.suse.de/~pbaudis/bug-441947-debug/ which contains extra debug prints and attach nscd logs associated with successful/unsuccessful resolution.) You may need to change /etc/nscd.log and activate the logfile-line to get the logfile :) Ok, I did some testing here: 2 addresses were successfull resolved, the third one failed. (Screenshot and log will be attached in a minute). The screenshot shows that after 3135: Haven't found "e106.suse.de" in hosts cache! (line 2222) nothing more is print to the log - nothing seems to happen - I would have expected an 'add entry to cache'-line coming - which does not happen. Created attachment 265603 [details]
screen shot of unresolved ping
Created attachment 265604 [details]
logfile while doing the pings
Anything else I can provide you? After some time I have to reboot to clear the cache again, then I am able to reproduce it again with various numbers. Ok, did some more tests. After changing hosts: files mdns4_minimal [NOTFOUND=return] dns in /etc/nsswitch.conf to hosts: files dns I was no longer able to reproduce this. Could someone else confirm this? Some background to Stefans comment:
On SLED nss_mdns4{,_minimal} is installed by default, so that it is always used. And if it cannot resolve a hostname for whatever reason, the resolver of glibc will never be used. So all glibc changes have no effect.
So this change disables nss_mdns4/avahi, you can reach the same effect by deinstallating this packages.
I don't think that's accurate description. glibc resolver will not be used only if mdns4 lookup succeeded but it was determined for sure that this hostname does not exist. If the hostname is out of scope, avahi not functional or whatever, UNAVAILABLE will be returned instead of NOTFOUND and dns will be still used. It is not clear to me how does this interact with the DNS changes in glibc though and how could this have broken. Stephan, do you have some further insights wrt. _being_ able to reproduce this bug? So the problem is not in GETAI requests but GETHOSTBYNAME requests, which is even more unexpected. Perhaps something is returning EAGAIN that shouldn't, but why? I started mbuild with more debug info, but it would still be helpful if I could get my hands on a machine where this failure actually shows up. In the meantime, can you latrace nscd -d? (available in obs) petr I install the nscd from your site. I enable log file in /etc/nscd.conf. The log file in /var/log/nscd.log is 0 ? I run nscd -k and the run nscd -d the attach file is from this output. 1. I ping google.com no problems 2. I host google.com no problems 3. I run firefox and it could resolve dns 4. I run evolution and try to get it mail and failed. Note: I'll be available another 30 min and then back in about 8 hours. -=terry=- Created attachment 265649 [details]
nscd trace unable to resolve dns -=terry=-
Changing hosts: files dns as Stefan did, did not make any difference in my machine. Teruel, please see comment 25 and comment 34, these glibc packages should fix your issue. Teruel: You also need to restarc nscd after changing the config. Petr and Stefan 1. I installed the glibc with the package from the link in comment 34. 2. I removed the DNS servers from yast and I checked the /etc/resolv.conf and now the only nameserver is the router. Of course ipv6 is enable. I reboot. 3. Firefox and Evolution work!! 4. I open xterm and stop nscd -K and restarted nscd -d 5. I ping and run the programs and I will attach the trace. Please give me instruction if I can help you with anything. Probably the following does not belong here: whenever you include these changes in the public release do not forget to update the livecd which is does work and I believe is related to this problem (it works with router that do not exhibit this problem) Created attachment 265748 [details] nscd after install glibc from "comment 34" -=terry=- I believe this is working for me. These were my steps:
1. Install SLED11 RC2 build 0015 (32-bit)
2. verify the issue still occurs.
3. Install the glibc rpms from comment #25.
4. Edit the /etc/nsswitch.conf file per comment #66
5. restart nscd
6. I can browse as expected now without reload each page several times.
it seems that I have to restart nscd every time the machine boots.
Teruel, thank you for your test. We have all the information needed and the rest needs to be gathered by me by hands-on tests on our machines anyway. To sum up recent development: I could reproduce the issue both with and without the mdns4 in nsswitch.conf. The culprit is not there. Instead, it turns out that within nscd, the initial resolver state has no nameservers loaded! I'm not sure about the reason, why getaddrinfo() works and why disabling glibc-2.10-dns-fixpack.diff helps, but I had no time at all to research based on this discovery yet - unfortunately, before I thought this patch is completely harmless, and it still appears so to me. :-( Anyway, taylor-pbaudis-2 mbuild, soon available at http://www.suse.de/~pbaudis/bug-441947-2/ is glibc with this patch disabled and it appears the issue is not reproducible with this mbuild. Please confirm. I think I have found the real reason for the problems, submitted at http://sourceware.org/bugzilla/show_bug.cgi?id=9753 - it passes my tests. After RC2 is out, for RC3 I would like to re-enable the patch again, with the extra hunk included in the sourceware bugzilla. Petr 1. I installed all the glibc from from your site comment #83 2. I did 3 ping 3. I load firefox 4. I load evolution and got the e-mail. Everything seems to be working well. I attach nscd trace (nscd_d3.txt) Created attachment 265846 [details] nscd trace after installing glibc from comment #83 -=terry=- Petr, Thxs for your great work. I going to setup the rest of the machines with opensuse 11.1 now this point is almost fix :). If I can help I am here. Have all of you a good weekend Thanks Petr! Petr's glibc patch works on both of my x86 and 86_64 machines. Firefox connects and YaST can connect to all repositories. Thanks! I'm affected by this slightly differently, pretty much as is described in this Fedora bug: https://bugzilla.redhat.com/show_bug.cgi?id=474800 With the existing nscd running and Petr's glibc rpm from comment #83 my problems go away on x86_64. I've not tried on x86. Great news, thanks! Submitting for RC3. I'm going to test another stability update for nscd for a day or so and then submit a 11.1 glibc update too, so that this bug can be closed. *** Bug 467898 has been marked as a duplicate of this bug. *** *** Bug 462675 has been marked as a duplicate of this bug. *** SLE11 is submitted. We are currently waiting for clearance on releasing 11.1 update, so lowering priority to P5. Coolo, update for 11.1 approved? Sure, but usually maint-coord approves things *** Bug 464560 has been marked as a duplicate of this bug. *** not sure we need to remaster, but if we do we take this *** This bug has been marked as a duplicate of bug 469307 *** Apologies if I don't understand the bug process here, but this is now marked resolved as a duplicate of a bug that is WONTFIX. That reads as though this fix won't be applied, which I'm sure is wrong. Working on the assumption that I have got the wrong end of the stick, is there any kind of timescale that I should expect it to appear? Thanks! Ad #104: Resolved as part of #103. General problem is fixed, we're working on an update of glibc, this is tracked via bug #387202 Update released for: glibc, glibc-debuginfo, glibc-debugsource, glibc-devel, glibc-html, glibc-i18ndata, glibc-info, glibc-locale, glibc-obsolete, glibc-profile, nscd Products: openSUSE 11.1 (debug, i586, i686, ppc, ppc64, x86_64) I just installed this update from the update repo and my name resolution stopped working, both with and without nscd.
Here is the last part of a strace of ping not involving nscd.
open("/lib64/libnss_dns.so.2", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \20\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=23112, ...}) = 0
mmap(NULL, 2117896, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1c8b971000
fadvise64(3, 0, 2117896, POSIX_FADV_WILLNEED) = 0
mprotect(0x7f1c8b976000, 2093056, PROT_NONE) = 0
mmap(0x7f1c8bb75000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x4000) = 0x7f1c8bb75000
close(3) = 0
mprotect(0x7f1c8bb75000, 4096, PROT_READ) = 0
munmap(0x7f1c8c6f3000, 122674) = 0
write(2, "ping: unknown host papaya.edfac."..., 44ping: unknown host papaya.edfac.usyd.edu.au
) = 44
exit_group(2) = ?
Here is the last part of a strace of ping involving nscd:
connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0
sendto(3, "\2\0\0\0\4\0\0\0\31\0\0\0papaya.edfac.usyd.ed"..., 37, MSG_NOSIGNAL, NULL, 0) = 37
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP}], 1, 5000) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
read(3, "\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0\1\0\0\0", 32) = 32
close(3) = 0
write(2, "ping: unknown host papaya.edfac."..., 44ping: unknown host papaya.edfac.usyd.edu.au
) = 44
exit_group(2) = ?
My name server is bind9, running on the same machine. It is set up as a non-forwarding server. Queries to it using dig work fine.
Host info are x86_64 arch, openSUSE 11.1, up to date with all patches up till today.
When I reverted to the previous version of glibc, name resolution worked again.
I fear that if this bug happens to others too, that this new package may prevent other users from getting further updates, painting them into a corner.
Can you please paste your /etc/resolv.conf? Here it is. I admit I never got netconfig to work. Should I have removed that last comment line? It worked fine before even with it. ## /etc/resolv.conf file autogenerated by netconfig! # # Before you change this file manually, consider to define the # static DNS configuration using the following variables in the # /etc/sysconfig/network/config file: # NETCONFIG_DNS_STATIC_SEARCHLIST # NETCONFIG_DNS_STATIC_SERVERS # NETCONFIG_DNS_FORWARDER # or disable DNS configuration updates via netconfig by setting: # NETCONFIG_DNS_POLICY='' # # See also the netconfig(8) manual page and other documentation. # # Note: Manual change of this file disables netconfig too, but # may get lost when this file contains comments or empty lines # only, the netconfig settings are same with settings in this # file and in case of a "netconfig update -f" call. # ### Please remove (at least) this line when you modify the file! 127.0.0.1 search ken.com.au You should have "nameserver 127.0.0.1" there. Does that help? Yes, that was unintentional, I simply forgot when editing it by hand. The old glibc must have let me get away with it. It works fine with the update now. Sorry to have wasted your time. Thanks. |