Bug 462769 - dns resolution problems - possible getaddrinfo() bugs
Summary: dns resolution problems - possible getaddrinfo() bugs
Status: RESOLVED DUPLICATE of bug 441947
Alias: None
Product: openSUSE 11.1
Classification: openSUSE
Component: Other (show other bugs)
Version: Final
Hardware: x86-64 openSUSE 11.1
: P5 - None : Major with 10 votes (vote)
Target Milestone: ---
Assignee: E-mail List
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-12-29 15:04 UTC by Andy Harrison
Modified: 2009-01-09 18:51 UTC (History)
1 user (show)

See Also:
Found By: Community User
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
The test program discussed in the comment above. (1.31 KB, text/x-csrc)
2008-12-30 21:52 UTC, Luca Gugelmann
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andy Harrison 2008-12-29 15:04:07 UTC
I discussed my problem originally in thread: http://lists.opensuse.org/opensuse/2008-12/msg00879.html

I thought my problem might be related to bug id 441947, but was advised to create a new bug report since disabling ipv6 did not solve my problem.

In a nutshell, everything worked fine with 10.2.  I performed a fresh install of 11.1rc1 (my zypper updates are current as of today), now lots and lots of dns resolution problems.  I am using the same name servers I used previously.  I even ended up even restoring a copy of my resolv.conf from my 10.2 box, still no joy.  (I use this same exact resolv.conf file on 75+ servers here at my location.)  Firefox, konq, ssh, telnet, zypper, whois, etc, all fail most of the time when trying to resolve names, always requiring lots of retries before successfully being able to resolve a name.  Yet the dig command *never* fails, even when I try to dig the address in question in the middle of my retries.

In particular, the whois command fails complaining about the exact system call in the title of bug 441947:  getaddrinfo(whois.crsnic.net): Name or service not known.

I've rebooted several times to make sure the system was fresh after making my changes.  I've combed over the logs.  For troubleshooting, I changed my resolv.conf to point to a single name server instead of two.  My dns servers handle the load of tens of thousands of customers, so if there were a problem with them, believe me, my doing a fresh linux install on my workstation would not be how I first find out about dns trouble.

I have disabled ipv6.  I also commented out the ipv6 related entries in /etc/hosts just to make sure it wasn't causing a problem.  I have disabled all firewall selinux and apparmor services.  I use neither dhcp nor NetworkManager.  I have the problem with or without nscd running.

I have pared down my nsswitch.conf so now it just contains:

# grep '^[^#]' /etc/nsswitch.conf
passwd: files ldap
group:  files ldap
hosts:  files dns
networks:       files
services:       files
protocols:      files
rpc:    files
ethers: files
netmasks:       files
netgroup:       files
publickey:      files
bootparams:     files
automount:      files nis
aliases:        files

My host.conf file is stock:

# grep '^[^#]' /etc/host.conf
order hosts, bind
multi on

Here's some info on my network configuration:

# grep '^[^#]' /etc/sysconfig/network/config
DEFAULT_BROADCAST="+"
GLOBAL_POST_UP_EXEC="yes"
GLOBAL_PRE_DOWN_EXEC="yes"
CHECK_DUPLICATE_IP="no"
DEBUG="no"
USE_SYSLOG="yes"
CONNECTION_SHOW_WHEN_IFSTATUS="no"
CONNECTION_CHECK_BEFORE_IFDOWN="no"
CONNECTION_CLOSE_BEFORE_IFDOWN="no"
CONNECTION_UMOUNT_NFS_BEFORE_IFDOWN="no"
CONNECTION_SEND_KILL_SIGNAL="no"
MANDATORY_DEVICES=""
WAIT_FOR_INTERFACES="20"
FIREWALL="no"
LINKLOCAL_INTERFACES="eth*[0-9]|tr*[0-9]|wlan[0-9]|ath[0-9]"
IFPLUGD_OPTIONS="-f -I -b"
NETWORKMANAGER="no"
NM_ONLINE_TIMEOUT="0"
NETCONFIG_MODULES_ORDER="dns-resolver dns-bind dns-dnsmasq nis ntp-runtime"
NETCONFIG_DNS_FORWARDER="resolver"
NETCONFIG_DNS_STATIC_SEARCHLIST="example.net foo.example.net"
NETCONFIG_DNS_STATIC_SERVERS="10.10.10.181 10.10.10.240"
NETCONFIG_NTP_POLICY=""
NETCONFIG_NTP_STATIC_SERVERS=""
NETCONFIG_NIS_POLICY=""
NETCONFIG_NIS_SETDOMAINNAME="yes"
NETCONFIG_NIS_STATIC_DOMAIN=""
NETCONFIG_NIS_STATIC_SERVERS=""
NETCONFIG_DNS_POLICY=""

In desperation, I did add the repository I found in another bug report http://download.opensuse.org/repositories/home:/mtomaschewski:/Factory/openSUSE_Factory/ to zypper and upgraded sysconfig, but that did nothing.

Please let me know what other information I can provide.
Comment 1 Andy Harrison 2008-12-30 16:01:57 UTC
Further lending credit that this may be a bug related to ipv6, I went through my /etc/ssh/ssh_config and ~/.ssh/config files and made sure that the AddressFamily keywords all had an argument of "inet" instead of "any" and now the ssh command is successful 100% of the time when resolving names.  Other commands such as whois continue to fail.
Comment 2 Luca Gugelmann 2008-12-30 21:50:38 UTC
I have the same problem and it seems indeed to be getaddrinfo() related. Specifically DNS resolution fails when ai_family is set to AF_UNSPEC in the second argument to getaddrinfo. AF_INET works as intended, so my understanding is that AF_UNSPEC should at least return the IPV4 address instead of failing.

Attached is a small test program, which I hope shows the problem.

The output on my side:
> ./dnstest novell.com
AF_INET:
  130.57.5.70
AF_INET6:
  getaddrinfo: Name or service not known
AF_UNSPEC:
  getaddrinfo: Name or service not known

compare with localhost (which does not go through a dns server):
> ./dnstest localhost
AF_INET:
  127.0.0.1
  127.0.0.1
AF_INET6:
  ::1
AF_UNSPEC:
  127.0.0.1
  ::1

I've been through much of the same troubleshooting as above, no success. ipv6 is disabled on my system (to at least have most of the gui internet apps work).
Comment 3 Luca Gugelmann 2008-12-30 21:52:56 UTC
Created attachment 262821 [details]
The test program discussed in the comment above.

gcc dnstest.c -o dnstest
Comment 4 Luca Gugelmann 2008-12-30 22:15:15 UTC
Further testing showed that once every few dozen queries the AF_UNSPEC case returns a correct answer. I tried to reproduce it and looking at the wireshark logs stumbled upon the following behavior:

- on an AF_INET query a request for the A record goes out and the correct answer comes in. Everything ok.

- on an AF_INET6 query a request for the AAAA record goes out and "not implemented" is the router's answer (as expected). (This is repeated 4 times.)

- on an AF_UNSPEC query a request for the A record goes out, then a request for the AAAA record goes out, then the answer for the AAAA query comes in (not implemented) and finally the answer to the A query. Note that my router answers the queries in reverse order. In this case getaddrinfo fails. Once in a while the order in which the answers come in is correct (I'm on a wireless network, so I assume sometimes the first packet is delayed). When the order of the answers is consistent with the order of the queries (that is answer to A first, AAAA later) getaddrinfo returns the correct ip.
Comment 5 Luca Gugelmann 2008-12-31 15:57:48 UTC
I'm no longer at my parent's house and the problem disappeared. Switching to a different router apparently fixes the problem, without requiring any configuration changes. This definitely points towards getaddrinfo choking on the answers by some broken(?) dns servers.

Comment 6 Andy Harrison 2009-01-08 20:59:18 UTC
I tried the dnstest program attached by Luca and got the same failures.  I tried installing the factory repository at http://download.opensuse.org/repositories/Base:/build/standard/ and seeing if those updates would help (in case they included glibc updates), but no joy.  So, as a work-around, I installed a recursion-only instance of named locally and pointed my resolv.conf to 127.0.0.1.  Works well enough.  If I can assist with further troubleshooting of the actual problem, I'd be happy to assist.
Comment 7 Petr Baudis 2009-01-09 02:51:51 UTC
The problem here is that getaddrinfo() still tries to resolve IPv6 AAAAs if IPv6 is disabled on your system - does ./dnstest localhost show AF_INET6 results if IPv6 is turned off? Can you paste your ip addr show output? lsmod | grep ipv6? 

Either IPv6 disabling is not working properly or there is bug in getaddrinfo() IPv6 auto-detection.
Comment 8 Petr Baudis 2009-01-09 03:20:45 UTC
Oh, I have just noticed - your getaddrinfo() call in ./dnstest has no AI_ADDRCONFIG in the ai_flags field - could you set it there instead of zero and try again?

To clarify, we will skip AAAA queries only if AI_ADDRCONFIG flag is used and no IPv6 interfaces are available. Not all applications use AI_ADDRCONFIG, but what is confusing is that your firefox still does not work with IPv6 disabled since it definitely should use AI_ADDRCONFIG.
Comment 9 Andy Harrison 2009-01-09 15:49:27 UTC
Apologies, I shouldn't have included firefox in this bug.  I was too liberal in my cutting and pasting of previous communications.  I'm not sure what I did to get firefox working correctly and even though it was showing these symptoms immediately after initial o/s installation, firefox was one of the first apps to start working smoothly for me when I started troubleshooting the problem.
Comment 10 Andy Harrison 2009-01-09 15:56:52 UTC
I have ipv6 disabled.  Here's the proof:

# lsmod | grep -i ipv6
#
# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
    inet 127.0.0.2/8 brd 127.255.255.255 scope host secondary lo
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
    link/ether 00:03:ba:f0:ce:50 brd ff:ff:ff:ff:ff:ff
    inet 192.168.3.104/20 brd 192.168.15.255 scope global eth0
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:50:04:d2:73:7d brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:50:04:62:0a:00 brd ff:ff:ff:ff:ff:ff
5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
    link/ether 00:03:ba:f0:ce:51 brd ff:ff:ff:ff:ff:ff
    inet 172.24.1.55/23 brd 172.24.1.255 scope global eth3



As for the dnstest program, I'm barely a read-only c programmer, so hopefully I did this correctly.  I changed hints.ai_flags to...

hints.ai_flags |= AI_ADDRCONFIG;

...and recompiled.  AF_UNSPEC results are successful 100% of the time now.


I grabbed the glibc src rpm you attached to bug 441947 and I'm compiling it now.
Comment 11 Luca Gugelmann 2009-01-09 17:27:52 UTC
Setting AI_ADDRCONFIG produces correct results with AF_UNSPEC queries here too. 

Further I tested the glibc from bug 441947 (of which this bug can now probably considered a duplicate) and the problem has disappeared regardless whether AI_ADDRCONF is set or not.
Comment 12 Andy Harrison 2009-01-09 18:51:36 UTC
Confirmed, glibc-2.9-5 from bug 441947 fixed the problem for me.

*** This bug has been marked as a duplicate of bug 441947 ***