Bug 115169

Summary: wwwoffle can not process some web pages
Product: [openSUSE] SUSE LINUX 10.0 Reporter: Björn Voigt <bjoernv>
Component: NetworkAssignee: Mads Martin Joergensen <mmj>
Status: RESOLVED WONTFIX QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None    
Version: Beta 4   
Target Milestone: ---   
Hardware: x86   
OS: All   
Whiteboard:
Found By: Beta-Customer Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Björn Voigt 2005-09-03 19:55:08 UTC
There is a URL parsing bug in wwwoffle. Unfortunately the author of wwwoffle
(Andrew M. Bishop) has not fixed this bug.

Here are the mails with Andrew about the bug.
-----------------------------------------------------------------------
Subject: wwwoffle-bug: URL-parsing does not work for www.altavista.com
From: Bjoern Voigt <bjoern@cs.tu-berlin.de>
Date: Thu, 26 May 2005 22:06:02 +0200 (CEST)
To: "Andrew M. Bishop" <amb@gedanken.demon.co.uk>

Hello Andrew!

In summer last year I wrote you, because of problems in the German translation.
Unfortunately up to now, I did have not found the time to fix it.

But I found a bug, which may interest you.

I relatively often use the Altavista search engine: 

  http://www.altavista.com/

Unfortunately, it's not possible to use this site with wwwoffle (since some
versions, tested with 2.8d (Linux) and 2.8e (FreeBSD, Linux)).

You can easily see the problem, if you

  a) type a search word into the search field in 
     http://www.altavista.com/

  b) click on of the links in this page

I tried to debug the problem. The main problem is, that wwwoffle does some URL
re-encodings. After this re-encodings the original URL path component differs
from the encoded form.

An example. There is an example link from http://de.altavista.com/ (I changed it
a little bit, because I do not know, if the URL contains private infos)

  
http://av.rds.yahoo.com/_ylt=A9ibyDZZCEq4AklmSLaMX;_ylu=X3oDBvNjNnZmYzBHBndANhdl93ZWJfaG9tZQRzZWMDdGFicw--/SIG=11nr22kc/EXP=111216420/**http%3a//de.altavista.com/dir/default

wwwoffle transforms it to (sniffed both with gdb and with ethereal):

   GET                   
/_ylt=A9ibyDZZCEq4AklmSLaMX;_ylu=X3oDBvNjNnZmYzBHBndANhdl93ZWJfaG9tZQRzZWMDdGFicw--/SIG=11nr22kc/EXP=111216420/**http://de.altavista.com/dir/default
   HTTP/1.1

Do you see the difference? "http%3a//" is transformed to "http://". 

I tried to find the problem in the source code. 

Please take a look at my attached patch. It changes the function 

   char *URLRecodeFormArgs(const char *str)

in wwwoffle-2.8e/src/miscencdec.c. I made ":" to an disallowed character. This
fixes the problem with Altavista. 

Another question is: Why do do you re-encode the path-components of an URL? I
think, if a client sends a malformed URL, wwwoffle could probably forward it.
But this can be a difficult question.

Greetings, Björn
-----------------------------------------------------------------------
Subject: Re: wwwoffle-bug: URL-parsing does not work for www.altavista.com
From: amb@gedanken.demon.co.uk (Andrew M. Bishop)
Date: 05 Jun 2005 16:04:14 +0100
To: Bjoern Voigt <bjoern@cs.tu-berlin.de>

Hi,


>> But I found a bug, which may interest you.
>> 
>> I relatively often use the Altavista search engine: 
>> 
>>   http://www.altavista.com/
>> 
>> Unfortunately, it's not possible to use this site with wwwoffle (since
>> some versions, tested with 2.8d (Linux) and 2.8e (FreeBSD, Linux)).
>> 
>> You can easily see the problem, if you
>> 
>>   a) type a search word into the search field in 
>>      http://www.altavista.com/
>> 
>>   b) click on of the links in this page
>> 
>> I tried to debug the problem. The main problem is, that wwwoffle does
>> some URL re-encodings. After this re-encodings the original URL path
>> component differs from the encoded form.
>> 
>> An example. There is an example link from http://de.altavista.com/ (I
>> changed it a little bit, because I do not know, if the URL contains
>> private infos)
>> 
>>   
http://av.rds.yahoo.com/_ylt=A9ibyDZZCEq4AklmSLaMX;_ylu=X3oDBvNjNnZmYzBHBndANhdl93ZWJfaG9tZQRzZWMDdGFicw--/SIG=11nr22kc/EXP=111216420/**http%3a//de.altavista.com/dir/default
>> 
>> wwwoffle transforms it to (sniffed both with gdb and with ethereal):
>> 
>>    GET                   
/_ylt=A9ibyDZZCEq4AklmSLaMX;_ylu=X3oDBvNjNnZmYzBHBndANhdl93ZWJfaG9tZQRzZWMDdGFicw--/SIG=11nr22kc/EXP=111216420/**http://de.altavista.com/dir/default
>>    HTTP/1.1
>> 
>> Do you see the difference? "http%3a//" is transformed to "http://". 
>> 
>> I tried to find the problem in the source code. 
>> 
>> Please take a look at my attached patch. It changes the function 
>> 
>>    char *URLRecodeFormArgs(const char *str)
>> 
>> in wwwoffle-2.8e/src/miscencdec.c. I made ":" to an disallowed
>> character. This fixes the problem with Altavista. 


I am very surprised that you are having this problem.  If the server
is following the URL specifications then it should be decoding the
'%3A' in the URL before it uses it.  This means that there is no
difference between a ':' and a '%3A' in the original URL when it
actually uses the path to find the file to send.

The ':' is not being used for a reserved purpose in the part of the
URL where it is encoded.  I make sure that even though I am not
encoding a reserved character there is no chance that it can be
wrongly interpreted.

I am not 100% sure where the problem is, WWWOFFLE or AltaVista, but my
opinion is that AltaVista is misinterpreting what WWWOFFLE is sending
and I will not be applying this patch.  Obviously the patch helps you,
so feel free to carry on using it, I have not had other reports that
this is a problem and it has been this way for a long time in
WWWOFFLE.


>> Another question is: Why do do you re-encode the path-components of an
>> URL? I think, if a client sends a malformed URL, wwwoffle could probably
>> forward it. But this can be a difficult question.


There is a README file 'README.URL' that describes why and when
WWWOFFLE performs URL encoding.  Basically the problem is that if you
don't do it then you end up with more than one URL that is actually
the same thing.  This means that you can end up with multiple copies
of the same web page in the cache or not be able to access a
previously cached page.

Now that you have made this change you might find that you cannot
access previously cached pages because the URL is different although
you enter exactly the same thing in the browser.

Andrew.
-----------------------------------------------------------------------

And here is my patch:
-----------------------------------------------------------------------
--- wwwoffle-2.8e/src/miscencdec.c.orig	2005-05-26 21:00:22.000000000 +0200
+++ wwwoffle-2.8e/src/miscencdec.c	2005-05-26 22:05:16.000000000 +0200
@@ -140,7 +140,7 @@
    The characters in the range 0x00-0x1f and 0x7f-0xff are always disallowed.
    The '%' character is always disallowed because it is the quote character.
    RFC 1738 section 2.2 calls " <>"#%{}|\^~[]`" unsafe characters, I make an
exception for "|~".
-   RFC 1738 section 2.2 calls ";/?:@=&" reserved characters, I make an
exception for "/:".
+   RFC 1738 section 2.2 calls ";/?:@=&" reserved characters, I make an
exception for "/".
    RFC 1866 section 8.2.1 says that ' ' is replaced by '+'.
    I disallow "'" because it may lead to confusion.
    The unencoded characters "&=;" on the input are left unencoded on the output
@@ -150,7 +150,7 @@
 
  static char allowed[257]=
  "                                "  /* 0x00-0x1f "                           
    " */
- " !  $   ()* ,-./0123456789:     "  /* 0x20-0x3f "
!"#$%&'()*+,-./0123456789:;<=>?" */
+ " !  $   ()* ,-./0123456789      "  /* 0x20-0x3f "
!"#$%&'()*+,-./0123456789;<=>?" */
  " ABCDEFGHIJKLMNOPQRSTUVWXYZ    _"  /* 0x40-0x5f
"@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_" */
  " abcdefghijklmnopqrstuvwxyz | ~ "  /* 0x60-0x7f
"`abcdefghijklmnopqrstuvwxyz{|}~ " */
  "                                "  /* 0x80-0x9f "                           
    " */
-----------------------------------------------------------------------
Comment 1 Mads Martin Joergensen 2005-09-05 12:37:20 UTC
The points made by the author makes me convinced I don't want to apply this
patch in the suse tree. Don't convince me, convince the author.