Bugzilla – Bug 530440
mysterious delay in processing of DNS replies in aria2c
Last modified: 2009-09-09 14:16:32 UTC
Here's a case of funny timeouts of 5-6 seconds happening with DNS lookups done by aria2c. It happens on Factory (zypper dup from today). I have collected quite some debugging info from the user who observed it. From what I saw, it doesn't look like a problem in DNS setup. Indeed, from looking at strace and tcpdump it seems that DNS replies are sent rather immediately from the nameserver, but aria2c still keeps polling for data. Another peculiarity is that each DNS lookup seems to happen twice. The symptom is that aria2c seems to sleep for e.g. 10 seconds before starting to download stuff. Could be seen with both aria2c 1.4.1 and 1.5.1. The same doesn't seem to be reproducible with wget, host, curl. Also, it doesn't happen on other systems: - openSUSE 11.0 - openSUSE Factory from mid of June - openSUSE Factory from mid of June, after zypper dup from today (still the old system running)
I'm the one with the troubles. If you need more info / testing I'll be happy to do so.
After rebooting my updated Factory, I still can't reproduce the hangs here. Too bad. Anyway, here's what data I have collected so far: Tcpdump showing the pauses between individual flows: 16:44:11.720504 IP 192.168.1.100.40990 > 192.168.1.254.domain: 5036+ A? ftp.hosteurope.de. (35) 16:44:11.720516 IP 192.168.1.100.40990 > 192.168.1.254.domain: 21345+ AAAA? ftp.hosteurope.de. (35) 16:44:11.763775 IP 192.168.1.254.domain > 192.168.1.100.40990: 21345 0/1/0 (86) 16:44:16.724809 IP 192.168.1.100.40990 > 192.168.1.254.domain: 5036+ A? ftp.hosteurope.de. (35) 16:44:16.725913 IP 192.168.1.100.52457 > 192.168.1.254.domain: 41816+ A? ftp.hosteurope.de. (35) 16:44:16.725923 IP 192.168.1.100.52457 > 192.168.1.254.domain: 33685+ AAAA? ftp.hosteurope.de. (35) 16:44:16.768285 IP 192.168.1.254.domain > 192.168.1.100.52457: 41816 1/0/0 A[|domain] 16:44:21.729850 IP 192.168.1.100.40990 > 192.168.1.254.domain: 5036+ A? ftp.hosteurope.de. (35) 16:44:21.730270 IP 192.168.1.100.52457 > 192.168.1.254.domain: 41816+ A? ftp.hosteurope.de. (35) 16:44:21.771732 IP 192.168.1.254.domain > 192.168.1.100.40990: 5036 1/0/0 A[|domain] 16:44:21.771770 IP 192.168.1.100.40990 > 192.168.1.254.domain: 21345+ AAAA? ftp.hosteurope.de. (35) 16:44:21.776133 IP 192.168.1.254.domain > 192.168.1.100.52457: 41816 1/0/0 A[|domain] 16:44:21.776155 IP 192.168.1.100.52457 > 192.168.1.254.domain: 33685+ AAAA? ftp.hosteurope.de. (35) 16:44:21.812710 IP 192.168.1.254.domain > 192.168.1.100.40990: 21345 0/1/0 (86) 16:44:26.735176 IP 192.168.1.100.55218 > 192.168.1.254.domain: 7019+ A? sunsite.rwth-aachen.de. (40) 16:44:26.735193 IP 192.168.1.100.55218 > 192.168.1.254.domain: 15183+ AAAA? sunsite.rwth-aachen.de. (40) 16:44:26.783206 IP 192.168.1.254.domain > 192.168.1.100.55218: 15183 1/1/0 (127) 16:44:31.734925 IP 192.168.1.100.55218 > 192.168.1.254.domain: 7019+ A? sunsite.rwth-aachen.de. (40) 16:44:31.776658 IP 192.168.1.254.domain > 192.168.1.100.55218: 7019 2/0/0[|domain] 16:44:31.776696 IP 192.168.1.100.55218 > 192.168.1.254.domain: 15183+ AAAA? sunsite.rwth-aachen.de. (40) 16:44:31.837851 IP 192.168.1.254.domain > 192.168.1.100.55218: 15183 1/1/0 (127) 16:44:31.838132 IP 192.168.1.100.35705 > 192.168.1.254.domain: 31718+ A? ftp.halifax.rwth-aachen.de. (44) 16:44:31.838147 IP 192.168.1.100.35705 > 192.168.1.254.domain: 914+ AAAA? ftp.halifax.rwth-aachen.de. (44) 16:44:31.881902 IP 192.168.1.254.domain > 192.168.1.100.35705: 31718 1/0/0 (60) 16:44:36.842907 IP 192.168.1.100.35705 > 192.168.1.254.domain: 31718+ A? ftp.halifax.rwth-aachen.de. (44) 16:44:36.902900 IP 192.168.1.254.domain > 192.168.1.100.35705: 31718 1/0/0 (60) 16:44:36.902938 IP 192.168.1.100.35705 > 192.168.1.254.domain: 914+ AAAA? ftp.halifax.rwth-aachen.de. (44) 16:44:36.946850 IP 192.168.1.254.domain > 192.168.1.100.35705: 914 0/1/0 (98) 16:44:36.947133 IP 192.168.1.100.37436 > 192.168.1.254.domain: 21271+ A? ftp.uni-bremen.de. (35) 16:44:36.947150 IP 192.168.1.100.37436 > 192.168.1.254.domain: 9133+ AAAA? ftp.uni-bremen.de. (35) 16:44:37.008985 IP 192.168.1.254.domain > 192.168.1.100.37436: 9133 1/1/0 CNAME[|domain] 16:44:41.951993 IP 192.168.1.100.37436 > 192.168.1.254.domain: 21271+ A? ftp.uni-bremen.de. (35) 16:44:41.993507 IP 192.168.1.254.domain > 192.168.1.100.37436: 21271 2/0/0 CNAME[|domain] 16:44:41.993553 IP 192.168.1.100.37436 > 192.168.1.254.domain: 9133+ AAAA? ftp.uni-bremen.de. (35) 16:44:42.037741 IP 192.168.1.254.domain > 192.168.1.100.37436: 9133 1/1/0 CNAME[|domain] 16:44:42.038017 IP 192.168.1.100.39617 > 192.168.1.254.domain: 60322+ A? ftp.rz.uni-wuerzburg.de. (41) 16:44:42.038032 IP 192.168.1.100.39617 > 192.168.1.254.domain: 2988+ AAAA? ftp.rz.uni-wuerzburg.de. (41) 16:44:42.097972 IP 192.168.1.254.domain > 192.168.1.100.39617: 60322 1/0/0 (57) 16:44:47.042966 IP 192.168.1.100.39617 > 192.168.1.254.domain: 60322+ A? ftp.rz.uni-wuerzburg.de. (41) 16:44:47.084376 IP 192.168.1.254.domain > 192.168.1.100.39617: 60322 1/0/0 (57) 16:44:47.084408 IP 192.168.1.100.39617 > 192.168.1.254.domain: 2988+ AAAA? ftp.rz.uni-wuerzburg.de. (41) 16:44:47.128776 IP 192.168.1.254.domain > 192.168.1.100.39617: 2988 0/1/0 (95) strace (from strace -f -tt -o aria2c.strace aria2c --file-allocation=none http://download.opensuse.org/distribution/11.1/repo/oss/suse/x86_64/kernel-source-2.6.27.7-9.1.x86_64.rpm): 11961 16:18:49.023661 stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=820, ...}) = 0 11961 16:18:49.023725 open("/etc/resolv.conf", O_RDONLY) = 10 11961 16:18:49.023770 fstat(10, {st_mode=S_IFREG|0644, st_size=820, ...}) = 0 11961 16:18:49.023825 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f310e69e000 11961 16:18:49.023865 read(10, "### /etc/resolv.conf file autoge"..., 4096) = 820 11961 16:18:49.023915 read(10, "", 4096) = 0 11961 16:18:49.023952 close(10) = 0 11961 16:18:49.023989 munmap(0x7f310e69e000, 4096) = 0 11961 16:18:49.024029 uname({sys="Linux", node="Goliath", ...}) = 0 11961 16:18:49.024132 socket(PF_INET, 0x802 /* SOCK_??? */, IPPROTO_IP) = 10 11961 16:18:49.024179 connect(10, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.1.254")}, 28) = 0 11961 16:18:49.024231 poll([{fd=10, events=POLLOUT}], 1, 0) = 1 ([{fd=10, revents=POLLOUT}]) 11961 16:18:49.024277 sendto(10, "I\233\1\0\0\1\0\0\0\0\0\0\3ftp\vtu-chemnitz\2de\0"..., 36, MSG_NOSIGNAL, NULL, 0) = 36 11961 16:18:49.024345 poll([{fd=10, events=POLLIN|POLLOUT}], 1, 5000) = 1 ([{fd=10, revents=POLLOUT}]) 11961 16:18:49.024390 sendto(10, "9A\1\0\0\1\0\0\0\0\0\0\3ftp\vtu-chemnitz\2de\0"..., 36, MSG_NOSIGNAL, NULL, 0) = 36 11961 16:18:49.024455 poll([{fd=10, events=POLLIN}], 1, 4999) = 1 ([{fd=10, revents=POLLIN}]) 11961 16:18:49.067486 ioctl(10, FIONREAD, [94]) = 0 11961 16:18:49.067544 recvfrom(10, "9A\201\200\0\1\0\0\0\1\0\0\3ftp\vtu-chemnitz\2de\0"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.1.254")}, [16]) = 94 11961 16:18:49.067618 poll([{fd=10, events=POLLIN}], 1, 4956) = 0 (Timeout) 11961 16:18:54.062894 poll([{fd=10, events=POLLOUT}], 1, 0) = 1 ([{fd=10, revents=POLLOUT}]) 11961 16:18:54.062953 sendto(10, "I\233\1\0\0\1\0\0\0\0\0\0\3ftp\vtu-chemnitz\2de\0"..., 36, MSG_NOSIGNAL, NULL, 0) = 36 11961 16:18:54.063030 poll([{fd=10, events=POLLIN}], 1, 5000) = 1 ([{fd=10, revents=POLLIN}]) 11961 16:18:54.104056 ioctl(10, FIONREAD, [52]) = 0 11961 16:18:54.104101 recvfrom(10, "I\233\201\200\0\1\0\1\0\0\0\0\3ftp\vtu-chemnitz\2de\0"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.1.254")}, [16]) = 52 11961 16:18:54.104172 poll([{fd=10, events=POLLOUT}], 1, 4958) = 1 ([{fd=10, revents=POLLOUT}]) 11961 16:18:54.104217 sendto(10, "9A\1\0\0\1\0\0\0\0\0\0\3ftp\vtu-chemnitz\2de\0"..., 36, MSG_NOSIGNAL, NULL, 0) = 36 11961 16:18:54.104278 poll([{fd=10, events=POLLIN}], 1, 4958) = 1 ([{fd=10, revents=POLLIN}]) 11961 16:18:54.148234 ioctl(10, FIONREAD, [94]) = 0 11961 16:18:54.148278 recvfrom(10, "9A\201\200\0\1\0\0\0\1\0\0\3ftp\vtu-chemnitz\2de\0"..., 1996, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.1.254")}, [16]) = 94 11961 16:18:54.148350 close(10) = 0 11961 16:18:54.148437 write(4, "2009-08-12 16:18:54.148416 INFO "..., 105) = 105 11961 16:18:54.148508 write(4, "2009-08-12 16:18:54.148494 INFO "..., 74) = 74 The strace rather looks as if aria2c is still polling for data even though it already got the DNS response. Not sure what to make of this.
Created attachment 312326 [details] aria2 log file as well as strace with timestamps
strace/tcpdump log says ipv4 address was returned but ipv6 query result was not returned. I think aria2c waited for the ipv6 query result, but it didn't come back before timeout(5 or 10seconds?). If strace/tcpdump log of wget doesn't show ipv6 dns queries, I think the problem is the slow look up of IPv6 address.
Aaah, that makes sense. That were actually two queries. I didn't understand this. Thanks a lot, Tatsuhiro! Stephan, can you reproduce the delay with "host -t AAAA <hostname>" maybe?
Unfortunately I can't: stephan@Goliath:~> time host -t AAAA ipv6.google.com ipv6.google.com is an alias for ipv6.l.google.com. ipv6.l.google.com has IPv6 address 2001:4860:a003::68 real 0m0.055s user 0m0.002s sys 0m0.002s (besides that I disabled ipv6 during installation so I would expect it not to bother with it anyways)
Created attachment 312719 [details] urlgrabber binary tcpdump of port 53 Those delays also exist with urlgrabber (used by osc). A tcpdump for urlgrabber http://download.opensuse.org/distribution/11.1/repo/oss/suse/x86_64/kernel-source-2.6.27.7-9.1.x86_64.rpm is attached
aria2 uses getaddrinfo function to resolve address(it supports asynchronous dns lookup, but according to the strace log, it is not compiled in, so I assume it is disabled). I searched about getaddrinfo and slow DNS lookup and found that it was caused by slow AAAA record lookup. It seems that even if ipv6 is disabled by the system configuration, getaddrinfo by default lookup AAAA records along with A. Some name server are very slow to resolve IPv6 addresses(sorry I don't know why). If name server is slow for AAAA lookup, I think all we can do is not issuing AAAA request and only issue A lookup. I think AI_ADDRCONFIG flag for getaddrinfo might help. Its man page said that: If hints.ai_flags includes the AI_ADDRCONFIG flag, then IPv4 addresses are returned in the list pointed to by result only if the local system has at least one IPv4 address configured, and IPv6 addresses are only returned if the local system has at least one IPv6 address configured. I'll upload a patch to use AI_ADDRCONFIG flag.
Created attachment 313126 [details] Patch to use AI_ADDRCONFIG flag
Thanks a lot for the patch Tatsuhiro but sadly it doesn't change anything. I'm also more inclined to believe that this isn't directly a fault of aria2 but some general issue of 11.2 since there are delays in other programs as well and the same aria2 version works just fine on 10.3.
Thanks for quick response, Stephan. I agree that this is a general issue of 11.2. There are nothing we application developer can do here. I am still concerned about IPv6/DNS lookup and found an interesting page: http://udrepper.livejournal.com/20948.html The interesting part is: "In glibc 2.9 I already implemented an improvement to the DNS NSS module which optimizes the lookup of IPv4 and IPv6 addresses for the same host. (snip) "The problem with this change was that there are broken DNS servers and broken firewall configurations which prevented the two results from being received successfully. Some broken DNS servers (especially those in cable modems etc) only send one reply. For this reason Fedora had this change disabled in F10. "The drawback is that a timeout is needed to detect the broken servers or firewalls. This delay is experienced once per process start and could be noticeable. But the broken setups of the few people affected must not prevent the far larger group of people with working setups to experience the advantage of the parallel lookup. I think this is possibly a cause of the issue: In previous 2 tcpdump logs, I observed that IPv4 and IPv6 DNS lookups were done but the name server responded only IPv4 request and there were delays. 11.2 uses glibc2.10 since Milestone 3. The above article says workaround for the delays: * Install a caching name server on this machine or somewhere on the local network. bind is known to work correctly. * Run nscd on the local machine. In this case the delay is incurred once per system start (i.e., at the first lookup nscd performs). * Add “single-request” to the options in /etc/resolv.conf. This selects the compatibility mode from the start.
Cheers Tatsuhiro, you totally hit the mark with that guess. After adding "options single-request" to my resolv.conf it works again as nicely as in earlier versions. Thanks a lot once more for your help, I guess I would have needed a million years to stumble about this ;D
Is it not a bug that this special configuration is needed, which nobody knows about, and which we can't expect normal users to find out about? I would think so. So even the issue found a resolution in this particular case, the issue is still there. And as far as I understand it, it does't occur too rarely. If it occurs rarely, then it might make sense to ignore it. But I can assure that it has at least cost me several hours to investigate. The same is true for Stephan and Tatsuhiro I think. Petr, I would like to hear your opinion about this as glibc maintainer. Tatsuhiro, we have to thank you very much for digging into this and coming up with the analysis, and solutions.
This should in fact occur fairly rarely, but of course it's hard to measure such things. Something we could do is add single-request to /etc/resolv.conf by default, but ideally, users should never notice since we run nscd by default; I assume it crashed soon for this user, but switching to unscd by default for 11.2 should help much, I have to consult with coolo if we still have time for that. I'm closing the bug for now at any rate; we will need more reports to see if this is a real-world issue anyway to consider changing the defaults.
Ah yes, I forgot to ask about nscd (which I would expect to be running per default). And it would be interesting to know which make/model the little router is - Stephan?
@Petr: FWIW: I just tried with nscd instead of "options single-request" and it works fine so I´m happy. @Peter: It is a "Fiberline CAS 2040" which is also pretty old (iirc ~10 years or so) so I think the chances that this works better with newer stuff are pretty good.
Thanks. This is relieving. So there is NO indication that this would affect millions of Germans who all use the same piece of hardware, which is currently sold by the biggest German ISP. Good!
Sorry, but I just stumbled about an additional issue regarding this I didn't think of earlier: I just did a NET install which took more than 4 hours and of those 4 hours it at least spent 3 sitting there and not doing anything cause of those stupid timeouts. So, it is perfectly understandable that this workaround isn't made the default and I'm perfectly happy to change it myself on an installed system but could you please be so kind to add said workaround as default __during the installation__ of a __NET install__? (If I understood it right the disadvantages for people who don't need this are barely noticeable while it plain sucks to have >400% install time).
Ok, then unscd should be running during the installation.
running now
Thanks, Steffen!