Bugzilla – Bug 115169
wwwoffle can not process some web pages
Last modified: 2005-09-05 12:37:20 UTC
There is a URL parsing bug in wwwoffle. Unfortunately the author of wwwoffle (Andrew M. Bishop) has not fixed this bug. Here are the mails with Andrew about the bug. ----------------------------------------------------------------------- Subject: wwwoffle-bug: URL-parsing does not work for www.altavista.com From: Bjoern Voigt <bjoern@cs.tu-berlin.de> Date: Thu, 26 May 2005 22:06:02 +0200 (CEST) To: "Andrew M. Bishop" <amb@gedanken.demon.co.uk> Hello Andrew! In summer last year I wrote you, because of problems in the German translation. Unfortunately up to now, I did have not found the time to fix it. But I found a bug, which may interest you. I relatively often use the Altavista search engine: http://www.altavista.com/ Unfortunately, it's not possible to use this site with wwwoffle (since some versions, tested with 2.8d (Linux) and 2.8e (FreeBSD, Linux)). You can easily see the problem, if you a) type a search word into the search field in http://www.altavista.com/ b) click on of the links in this page I tried to debug the problem. The main problem is, that wwwoffle does some URL re-encodings. After this re-encodings the original URL path component differs from the encoded form. An example. There is an example link from http://de.altavista.com/ (I changed it a little bit, because I do not know, if the URL contains private infos) http://av.rds.yahoo.com/_ylt=A9ibyDZZCEq4AklmSLaMX;_ylu=X3oDBvNjNnZmYzBHBndANhdl93ZWJfaG9tZQRzZWMDdGFicw--/SIG=11nr22kc/EXP=111216420/**http%3a//de.altavista.com/dir/default wwwoffle transforms it to (sniffed both with gdb and with ethereal): GET /_ylt=A9ibyDZZCEq4AklmSLaMX;_ylu=X3oDBvNjNnZmYzBHBndANhdl93ZWJfaG9tZQRzZWMDdGFicw--/SIG=11nr22kc/EXP=111216420/**http://de.altavista.com/dir/default HTTP/1.1 Do you see the difference? "http%3a//" is transformed to "http://". I tried to find the problem in the source code. Please take a look at my attached patch. It changes the function char *URLRecodeFormArgs(const char *str) in wwwoffle-2.8e/src/miscencdec.c. I made ":" to an disallowed character. This fixes the problem with Altavista. Another question is: Why do do you re-encode the path-components of an URL? I think, if a client sends a malformed URL, wwwoffle could probably forward it. But this can be a difficult question. Greetings, Björn ----------------------------------------------------------------------- Subject: Re: wwwoffle-bug: URL-parsing does not work for www.altavista.com From: amb@gedanken.demon.co.uk (Andrew M. Bishop) Date: 05 Jun 2005 16:04:14 +0100 To: Bjoern Voigt <bjoern@cs.tu-berlin.de> Hi, >> But I found a bug, which may interest you. >> >> I relatively often use the Altavista search engine: >> >> http://www.altavista.com/ >> >> Unfortunately, it's not possible to use this site with wwwoffle (since >> some versions, tested with 2.8d (Linux) and 2.8e (FreeBSD, Linux)). >> >> You can easily see the problem, if you >> >> a) type a search word into the search field in >> http://www.altavista.com/ >> >> b) click on of the links in this page >> >> I tried to debug the problem. The main problem is, that wwwoffle does >> some URL re-encodings. After this re-encodings the original URL path >> component differs from the encoded form. >> >> An example. There is an example link from http://de.altavista.com/ (I >> changed it a little bit, because I do not know, if the URL contains >> private infos) >> >> http://av.rds.yahoo.com/_ylt=A9ibyDZZCEq4AklmSLaMX;_ylu=X3oDBvNjNnZmYzBHBndANhdl93ZWJfaG9tZQRzZWMDdGFicw--/SIG=11nr22kc/EXP=111216420/**http%3a//de.altavista.com/dir/default >> >> wwwoffle transforms it to (sniffed both with gdb and with ethereal): >> >> GET /_ylt=A9ibyDZZCEq4AklmSLaMX;_ylu=X3oDBvNjNnZmYzBHBndANhdl93ZWJfaG9tZQRzZWMDdGFicw--/SIG=11nr22kc/EXP=111216420/**http://de.altavista.com/dir/default >> HTTP/1.1 >> >> Do you see the difference? "http%3a//" is transformed to "http://". >> >> I tried to find the problem in the source code. >> >> Please take a look at my attached patch. It changes the function >> >> char *URLRecodeFormArgs(const char *str) >> >> in wwwoffle-2.8e/src/miscencdec.c. I made ":" to an disallowed >> character. This fixes the problem with Altavista. I am very surprised that you are having this problem. If the server is following the URL specifications then it should be decoding the '%3A' in the URL before it uses it. This means that there is no difference between a ':' and a '%3A' in the original URL when it actually uses the path to find the file to send. The ':' is not being used for a reserved purpose in the part of the URL where it is encoded. I make sure that even though I am not encoding a reserved character there is no chance that it can be wrongly interpreted. I am not 100% sure where the problem is, WWWOFFLE or AltaVista, but my opinion is that AltaVista is misinterpreting what WWWOFFLE is sending and I will not be applying this patch. Obviously the patch helps you, so feel free to carry on using it, I have not had other reports that this is a problem and it has been this way for a long time in WWWOFFLE. >> Another question is: Why do do you re-encode the path-components of an >> URL? I think, if a client sends a malformed URL, wwwoffle could probably >> forward it. But this can be a difficult question. There is a README file 'README.URL' that describes why and when WWWOFFLE performs URL encoding. Basically the problem is that if you don't do it then you end up with more than one URL that is actually the same thing. This means that you can end up with multiple copies of the same web page in the cache or not be able to access a previously cached page. Now that you have made this change you might find that you cannot access previously cached pages because the URL is different although you enter exactly the same thing in the browser. Andrew. ----------------------------------------------------------------------- And here is my patch: ----------------------------------------------------------------------- --- wwwoffle-2.8e/src/miscencdec.c.orig 2005-05-26 21:00:22.000000000 +0200 +++ wwwoffle-2.8e/src/miscencdec.c 2005-05-26 22:05:16.000000000 +0200 @@ -140,7 +140,7 @@ The characters in the range 0x00-0x1f and 0x7f-0xff are always disallowed. The '%' character is always disallowed because it is the quote character. RFC 1738 section 2.2 calls " <>"#%{}|\^~[]`" unsafe characters, I make an exception for "|~". - RFC 1738 section 2.2 calls ";/?:@=&" reserved characters, I make an exception for "/:". + RFC 1738 section 2.2 calls ";/?:@=&" reserved characters, I make an exception for "/". RFC 1866 section 8.2.1 says that ' ' is replaced by '+'. I disallow "'" because it may lead to confusion. The unencoded characters "&=;" on the input are left unencoded on the output @@ -150,7 +150,7 @@ static char allowed[257]= " " /* 0x00-0x1f " " */ - " ! $ ()* ,-./0123456789: " /* 0x20-0x3f " !"#$%&'()*+,-./0123456789:;<=>?" */ + " ! $ ()* ,-./0123456789 " /* 0x20-0x3f " !"#$%&'()*+,-./0123456789;<=>?" */ " ABCDEFGHIJKLMNOPQRSTUVWXYZ _" /* 0x40-0x5f "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_" */ " abcdefghijklmnopqrstuvwxyz | ~ " /* 0x60-0x7f "`abcdefghijklmnopqrstuvwxyz{|}~ " */ " " /* 0x80-0x9f " " */ -----------------------------------------------------------------------
The points made by the author makes me convinced I don't want to apply this patch in the suse tree. Don't convince me, convince the author.