Bug 153557 - Apache's directory auto-index can not display UTF-8 filenames correctly
Summary: Apache's directory auto-index can not display UTF-8 filenames correctly
Status: RESOLVED FIXED
Alias: None
Product: openSUSE 10.2
Classification: openSUSE
Component: Network (show other bugs)
Version: unspecified
Hardware: PC Other
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: E-mail List
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-02-25 18:45 UTC by Björn Voigt
Modified: 2007-08-31 11:01 UTC (History)
1 user (show)

See Also:
Found By: Other
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
the patch which I mentioned (2.27 KB, patch)
2006-12-21 09:44 UTC, Peter Poeml
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Björn Voigt 2006-02-25 18:45:26 UTC
File names are encoded with UTF-8 by default in newer versions of SuSE Linux. But the Apache directory auto-index (mod_autoindex) can not display the UTF-8 filenames correctly, because SuSE's Apache package sends directory auto-indexes in ISO-8859-1 (default character set).

I found this problem in SuSE 10.0 Final (package apache2-2.0.54-10.3) and SuSE 10.1 Beta5 (package apache2-2.2.0-8). 

There is a preprocessor directive APR_HAS_UNICODE_FS in Apache 2.0 and 2.2. This directive is now '0' but can be set to '1' in httpd-2.0.54/srclib/apr/include/apr.h.

There is no file 'apr.h' in package apache2-2.2.0-8. This is because this package depends on a separate APR package (libapr1-1.2.2-4). libapr1 contains the file apr.h.

I have not tested the setting "#define APR_HAS_UNICODE_FS 1".

But probably it works fine. Take a look at this code from httpd-2.2.0/modules/generators/mod_autoindex.c:

#if APR_HAS_UNICODE_FS
    ap_set_content_type(r, "text/html;charset=utf-8");
#else
    ap_set_content_type(r, "text/html");
#endif
Comment 1 Peter Poeml 2006-03-02 15:13:47 UTC
This is making an assumption:
That filenames are in fact encoded in UTF-8.

The filesystem is able since a long time to handle those, but that
doesn't mean that they are used. It is still possible to have filenames
in iso-8559-* or other encodings on disk.

It can be solved by appropriate configuration: add 
AddDefaultCharset UTF-8
for instance in Directory context.

Am I missing something?
Comment 2 Björn Voigt 2006-03-08 12:39:59 UTC
Adding "AddDefaultCharset UTF-8" is capable of solving my filename problem, but with an annoying side effect. All HTML (without a special charset-setting in the HTML-header) and TXT files are also delivered with UTF-8 charset (in HTTP header) with this setting. Since the standard encoding for HTML is ISO-8859-1 and most people write HTML in ISO-8859-1 it's likely that SuSE users:
- write HTML files in ISO-8859-1
- write filenames in UTF-8 (standard locale setting)

Unfortunately I must decide between two bad alternatives with the current apache2 packages:

Alternative 1 - with "AddDefaultCharset UTF-8") 
Correct display of UTF-8 file names in directory indexes and incorrect display of ISO-8859-1 HTML files.

Alternative 2 - standard settings) 
Correct display of ISO-8859-1 HTML files and incorrect display of UTF-8 file names.

Adding both charsets like 

AddDefaultCharset UTF-8
AddDefaultCharset ISO-8859-1

does not work (the last setting has precedence).

I still think, that changing the Apache configuration ("#define APR_HAS_UNICODE_FS 1") is better for most SuSE users than this bad alternatives above.

Of course, I would welcome a better solution within the Apache code. For instance it would be a good idea for the Apache developers to add an directive like "AddDirectoryIndexCharset UTF-8" and/or an auto detection of the filesystem charset.
Comment 3 Peter Poeml 2006-05-10 16:14:41 UTC
Mike, would you be so kind to comment? What do you think?
Comment 4 Mike Fabian 2006-10-27 13:59:21 UTC
I dislike "AddDefaultCharset UTF-8" very much because this makes
it impossible to write any HTML pages in other encodings.

Even if the HTML pages use a charset setting in the HTML header,
for example:

    <html>
      <head> <title> </title>
	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=big5">
      </head>

this won't work because the charset from the HTTP header is always
preferred.

Therefore I always use "AddDefaultCharset Off" in my apache2 setup
and make sure that all pages specify the charset in the HTML header.

Of course I prefer HTML pages in UTF-8 and most of the pages I write
are in UTF-8. But using other encodings for special pages should be
possible. At least I need it for some test pages. If the HTTP
header says "UTF-8" and this overrides everthing, I cannot even
have a single test page in a different encoding.

Therefore I agree with Björn that directory indices should get
special treatment and should be treated as UTF-8 on SuSE Linux.

As Peter says, other encodings are possible in filenames, but
we are using UTF-8 as the default for a long time already
in SuSE Linux, therefore one should assume that file names are in
UTF-8. File names which are not should be converted.

Having file names in mixed encodings is asking for trouble. Web
pages in different encoding can have a charset header but
file names can have no tags which say which encoding is used
in the file name.

Björn also suggested that auto detection of the filesystem charset
might be a good idea. I don't think this will really work as
auto detection of legacy encodings is a difficult problem which
can never work in all cases. One can only use heuristics which
work sometimes but not always.

UTF-8 is easily auto-detectable though, therefore Apache could send
UTF-8 in the HTTP header for a directory index if all files in this
directory have UTF-8 encoded files names and fall back to "something else"
if some files in do not have UTF-8 encoded file names.

I like the idea of "AddDirectoryIndexCharset something". This could be used
together with UTF-8 auto-detection to specify the "something else"
which should be used if not all file names are UTF-8 encoded.

As it can be easily checked if all file names are UTF-8 encoded or not,
it is probably a good idea to let apache do that and send UTF-8 in
the HTTP header always in that case, ignoring "AddDirectoryIndexCharset ...".
"AddDirectoryIndexCharset ..." would then only specify the charset to used
if the auto-detection finds that the directory contains file names which
are not UTF-8 encoded. Probably one should change the
name a bit though, e.g. "AddDirectoryIndexCharsetFallback ...".

Comment 5 Peter Poeml 2006-12-20 16:55:34 UTC
Thank you (both of you) for the detailed comments.

The APR_HAS_UNICODE_FS in libapr1 is a Windows-only thing.

In mod_autoindex is the only place where httpd uses the define.

Would it help to first make "text/html;charset=utf-8" the default,
and then try to add AddDirectoryIndexCharset upstream?
I think so.
Comment 6 Peter Poeml 2006-12-21 09:43:52 UTC
I am adding the attached patch to the build service apache2 packages,
which sets the default to utf-8, and at the same time adds a directive
AddDirectoryIndexCharset which allows to override the default.
It is maybe not enough for upstream inclusion yet, because I need to
find out about possible platforms where this (change of) behaviour does
not make place. I need to further think about this and discuss with
upstream.
But I think that the behaviour is what you want now (short of
auto-detection), so if you could try out the packages from 
http://software.opensuse.org/download/Apache/ and provide feedback I
would be most grateful.
Comment 7 Peter Poeml 2006-12-21 09:44:47 UTC
Created attachment 110653 [details]
the patch which I mentioned
Comment 8 Peter Poeml 2007-03-14 10:49:07 UTC
status: need to submit patch again with documentation patch
Comment 9 Peter Poeml 2007-03-23 06:43:39 UTC
status: patch with documentation patch sent upstream
Comment 10 Peter Poeml 2007-04-25 18:05:26 UTC
No news from upstream.
We might consider to disable the patch for now, until it is included
upstream, so we don't become incompatible at some time.

BTW, the bug I opened at apache.org is this one:
http://issues.apache.org/bugzilla/show_bug.cgi?id=42105
Comment 11 Peter Poeml 2007-07-13 11:13:58 UTC
There is no news on this.
Comment 12 Peter Poeml 2007-08-30 23:08:13 UTC
News!




From: bugzilla@apache.org                                                                                                                                                                                                                  
To: poeml@suse.de                                                                                                                                                                                                                          
Subject: DO NOT REPLY [Bug 42105]  - Patch for mod_autoindex to set the character set                                                                                                                                                      
Date: Thu, 30 Aug 2007 14:45:51 -0700 (PDT)                                                                                                                                                                                                
X-Spam-Status: No, hits=1.0 tagged_above=-20.0 required=5.0 tests=BAYES_50,                                                                                                                                                                
 NO_REAL_NAME, UPPERCASE_25_50                                                                                                                                                                                                             

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG\267
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=42105>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND\267
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=42105





------- Additional Comments From wrowe@apache.org  2007-08-30 14:45 -------
Something similar was created to add IndexOptions Type=content/type Charset=foo

and will be available in the next 2.0 and 2.2 releases of httpd.

We are a bit premature to presume a utf-8 on unix-ish systems, because by
definition they are bytestreams.  But that said, OS/X made it explicit that
filenames are UTF-8, so we follow your suggestion on at least one 'unix' :)

Thank you for your report!

--                                                                                                                                                                                                                                         
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email                                                                                                                                                               
------- You are receiving this mail because: -------                                                                                                                                                                                       
You reported the bug, or are watching the reporter.                                                                                                                                                                                        



Comment 13 Peter Poeml 2007-08-31 09:07:45 UTC
Bill talks about a patch from trunk which has been backported to the
2.2.x and 2.0.x branches.

+  *) mod_autoindex: Add in ContentType and Charset options to
+     IndexOptions directive. This allows the admin to explicitly
+     set the content-type and charset of the generated page.
+     [Jim Jagielski]

The 2.2.x backport is here:

URL: http://svn.apache.org/viewvc?rev=570962&view=rev
Log:
Merge r570532, r570535, r570558 from trunk:
IndexOptions ContentType=text/html Charset=UTF-8 magic.


I strongly suggest to drop my patch
httpd-2.2.3-AddDirectoryIndexCharset.patch, before 10.3 releae, because
it adds a new directive AddDirectoryIndexCharset which is not going to
be upstream.  Since the problem was solved in a different way upstreams,
we would become incompatible.

Comment 14 Peter Poeml 2007-08-31 09:12:27 UTC
Christoph, please consider this for 10.3. This long-time issue has
spontaneously resolved itself last night. It would be good if we drop my
patch and add the 2.2 backport instead, which is going to ship with
2.2.6 once that is released. 

We should drop my patch at any rate. Adding the other one instead would
be cool so that there is a replacement for the functionality, but it's
not obligatory that we do so...
Comment 15 Christoph Thiel 2007-08-31 09:17:07 UTC
Sure, go for it.
Comment 16 Peter Poeml 2007-08-31 11:01:12 UTC
I submitted a fixed package to Factory.