Bug 98496 - hypermail encoding problems
Summary: hypermail encoding problems
Status: RESOLVED FIXED
Alias: None
Product: openSUSE 10.3
Classification: openSUSE
Component: Other (show other bugs)
Version: Alpha 0plus
Hardware: All All
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: Hendrik Vogelsang
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-07-26 13:52 UTC by Mike Fabian
Modified: 2007-02-15 12:03 UTC (History)
4 users (show)

See Also:
Found By: Development
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
mbox (334.81 KB, application/octet-stream)
2005-07-26 16:56 UTC, Mike Fabian
Details
64bit.patch (3.21 KB, patch)
2007-02-01 16:12 UTC, Mike Fabian
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Mike Fabian 2005-07-26 13:52:11 UTC
Hypermail 2.1.8 has many problems with mail archives when
the archive contains mails with different subject encodings.

I'll attach sample files and screen shots to reproduce the problem.
Comment 1 Mike Fabian 2005-07-26 13:54:29 UTC
Reassigned to  Anna Bernathova <anicka@novell.com>, maintainer of
our hypermail package.
Comment 2 Mike Fabian 2005-07-26 14:49:56 UTC
As an example of the bug, see:

http://lists.suse.com/archive/suse-linux-ja/2005-Jul/

You can see that the subjects of the first 5 mails on that page are
displayed correctly (in UTF-8). But from the 6th mail on, all subjects
are garbled. This is because the encoding of the subjects of these
mails is ISO-2022-JP.

That means, the .html page produced contains text with two different
encodings which cannot work.

You can see that if you switch the encoding of your browser manually
from UTF-8 to ISO-2022-JP and back. When using UTF-8, you see the
subjects at the top correctly, when using ISO-2022-JP, you see the
mails at the bottom correctly. But never both.

Comment 3 Mike Fabian 2005-07-26 14:52:45 UTC
When looking at the sources of hypermail 2.1.8 and 2.2.0, I found that
there is very little support for different encodings, basically only
for ISO-8859-1 and ISO-2022-JP. And even that appears to be partly broken.

Comment 4 Mike Fabian 2005-07-26 16:54:40 UTC
I updated hypermail to 2.2.0 in STABLE.

It doesn't fix the problem reported in this bug, but if we
try to fix it we should probably start with the latest version.
Comment 5 Mike Fabian 2005-07-26 16:56:21 UTC
Created attachment 43434 [details]
mbox

Unix mailbox file for testing hypermail and reproducing the bug.
Comment 6 Mike Fabian 2005-07-26 16:58:04 UTC
A test archive which shows the bug can be created as follows,
using the "mbox" file attached to comment #5:

mfabian@magellan:/tmp/hypermail-test$ hypermail -m mbox -d test-archive
mfabian@magellan:/tmp/hypermail-test$ ls test-archive/
0000.html  0013.html  0026.html  0039.html  0052.html  0065.html
0001.html  0014.html  0027.html  0040.html  0053.html  0066.html
0002.html  0015.html  0028.html  0041.html  0054.html  0067.html
0003.html  0016.html  0029.html  0042.html  0055.html  0068.html
0004.html  0017.html  0030.html  0043.html  0056.html  0069.html
0005.html  0018.html  0031.html  0044.html  0057.html  0070.html
0006.html  0019.html  0032.html  0045.html  0058.html  0071.html
0007.html  0020.html  0033.html  0046.html  0059.html  0072.html
0008.html  0021.html  0034.html  0047.html  0060.html  attachment.html
0009.html  0022.html  0035.html  0048.html  0061.html  author.html
0010.html  0023.html  0036.html  0049.html  0062.html  date.html
0011.html  0024.html  0037.html  0050.html  0063.html  index.html
0012.html  0025.html  0038.html  0051.html  0064.html  subject.html
mfabian@magellan:/tmp/hypermail-test$
Comment 7 Mike Fabian 2005-07-26 17:03:16 UTC
Now view the file "index.html" created as explained in comment #6 with
a browser, e.g. Firefox.

Switch the browser between UTF-8 encoding and ISO-2022-JP encoding.

When using UTF-8, the first 8 subjects are displayed correctly,
all other subjects are garbage.

When using ISO-2022-JP encoding, the subjects up to the 8th subject
are displayed as garbage and the subjects below the 8th subject
become partly readable.

Note the *partly*, even if switching the browser to ISO-2022-JP, the
ISO-2022-JP encoded subjects become only partly readable.

Comment 8 Mike Fabian 2005-07-26 17:16:06 UTC
The partly broken ISO-2022-JP can be fixed by using the option
"iso2022jp = On" in ~/.hmrc:

mfabian@magellan:~$ cat .hmrc
iso2022jp = On
mfabian@magellan:~$

That doesn't help when index.html contains a mixture of subjects with
different encodings.

When looking at the archive

http://lists.suse.com/archive/suse-linux-ja/

most months are OK because UTF-8 encoded subjects are still rare
in Japanese mails. The most common encoding for Japanese mails
is still ISO-2022-JP and for most months in that archive,
this happened to be the only encoding which was used in subjects.

But each time somebody uses a different encoding for a subject,
e.g. UTF-8, the archive for that month will end up at least partly
as garbage.
Comment 9 Mike Fabian 2005-07-26 17:18:43 UTC
I think the best fix is to improve hypermail to convert all the output
to UTF-8 *always*. That is the only possibility to get a single
target encoding for index.html, date.html, etc. even if many different
encodings are used in the subjects of the original mails.

Changing hypermail to do that appears to be not really difficult but
still a lot of tedious work.

Comment 10 Mike Fabian 2005-07-28 09:20:47 UTC
I reported the problem on the hypermail mailing list:

http://www.hypermail-project.org/archive/05/07/2715.html

and there was one response so far:

http://www.hypermail-project.org/archive/05/07/2716.html

Maybe I'll have to write a patch.
Comment 11 Anna Maresova 2005-08-11 14:24:55 UTC
Whats the situation so far?

Are you going to write the patch? I have looked at it a bit and it seems to be
quite complicated.
Comment 12 Mike Fabian 2005-08-16 10:08:16 UTC
There was one response by Daigo Matsubara:

http://www.hypermail-project.org/archive/05/07/2719.html

From: Daigo Matsubara <daigo@w3.org>
Subject: Re: hypermail encoding problems
To: mfabian@suse.de
Cc: "Peter C. McCluskey" <pcm@rahul.net>,
	hypermail@hypermail-project.org, Daigo Matsubara <daigo@w3.org>
Date: Fri, 29 Jul 2005 12:51:19 +0900
Organization: World Wide Web Consortium/Keio University
Gnus-Warning: This is a duplicate of message <nm8iryu1d8o.wl@w3.mag.keio.ac.jp>
User-Agent: Wanderlust/2.10.1 (Watching The Wheels) SEMI/1.14.6 (Maruoka) FLIM/1.14.6 (Marutamachi) APEL/10.6 Emacs/21.4 (i386-pc-linux-gnu) MULE/5.0 (SAKAKI)
Content-Type: text/plain; charset=ISO-2022-JP


At Thu, 28 Jul 2005 11:15:02 +0200,
Mike FABIAN wrote:
> 
> "Peter C. McCluskey" <pcm@rahul.net> さんは書きました:
> 
> >  mfabian@suse.de (Mike FABIAN) writes:
> >>I think the best fix is to improve hypermail to convert all the output
> >>to UTF-8 *always*. That is the only possibility to get a single target
> >>encoding for index.html, date.html, etc. even if many different
> >>encodings are used in the subjects of the original mails.
> >>
> >>Are there any plans to add such support for UTF-8 to hypermail in the
> >>near future?
> >
> >  I suspect you are right about what should be done.
> >
> >  Daigo Matsubara <daigo@w3.org> has reportedly made some changes to
> > support UTF-8 which haven't been checked in yet. If he isn't addressing
> > the problem you mention, then it's unlikely that anyone has plans to
> > (in which case I would encourage you to submit a patch).
> 
> Matsubara San,
> 
> does your patch address the problem? If not I'll try to make a patch.

Hi Mike,

I had implemented roughly, it is working on my testbed.

My strategy is:

1) convert every headers to UTF-8 at first. I modified
   mdecodeRFC2047() to do it.

2) call print_main_header() with UTF-8 charset, then Hypermail outputs
   indexes in UTF-8.

3) in each message, each references (subjects of other messages in
   thread) are encoded in numeric reference. message body is not
   converted to UTF-8.

I'm still considering about 3). I was suggested by I18N experts to
make everything in UTF-8, but I'm wondering about that because we
still have a lot of software which is not UTF-8 friendly.

But, at least, it solves encoding issue you mentioned, I think.

I'm trying to show my code ASAP to have review. Ideas/thoughts are
welcome.

Thanks,
-- 
Daigo Matsubara / W3C Systems Team / mailto:daigo@w3.org

Comment 13 Mike Fabian 2005-08-16 10:09:28 UTC
I replied:

http://www.hypermail-project.org/archive/05/07/2720.html

From: Mike FABIAN <mfabian@suse.de>
Subject: Re: hypermail encoding problems
To: Daigo Matsubara <daigo@w3.org>
Cc: "Peter C. McCluskey" <pcm@rahul.net>,  hypermail@hypermail-project.org
Date: Fri, 29 Jul 2005 15:22:34 +0200
Reply-to: mfabian@suse.de
User-Agent: Gnus/5.1006 (Gnus v5.10.6) XEmacs/21.5 (corn, linux)
Content-Type: text/plain; charset=iso-2022-jp

Daigo Matsubara <daigo@w3.org> さんは書きました:

> I had implemented roughly, it is working on my testbed.
>
> My strategy is:
>
> 1) convert every headers to UTF-8 at first. I modified
>    mdecodeRFC2047() to do it.
>
> 2) call print_main_header() with UTF-8 charset, then Hypermail outputs
>    indexes in UTF-8.
>
> 3) in each message, each references (subjects of other messages in
>    thread) are encoded in numeric reference. message body is not
>    converted to UTF-8.
>
> I'm still considering about 3). I was suggested by I18N experts to
> make everything in UTF-8, but I'm wondering about that because we
> still have a lot of software which is not UTF-8 friendly.

In my experience, it is not much of a problem anymore, all common
browsers seem to support UTF-8 well enough.

> But, at least, it solves encoding issue you mentioned, I think.

> I'm trying to show my code ASAP to have review. Ideas/thoughts are
> welcome.

Yes, please post your code. I'd like to try it as well.

-- 
Mike FABIAN   <mfabian@suse.de>   http://www.suse.de/~mfabian
睡眠不足はいい仕事の敵だ。
Comment 14 Mike Fabian 2005-08-16 10:12:13 UTC
Anna> Are you going to write the patch? I have looked at it a bit and
Anna> it seems to be quite complicated.

Let's wait a while whether Daigo Matsubara posts his patch.
I don't want to duplicate his work.

Anyway I'm too busy now because of SuSE Linux 10.0.

After the SuSE Linux 10.0 release, if there is still no patch
available, I'll try to write a patch myself.

Comment 15 Mike Fabian 2007-01-31 19:48:44 UTC
There has been upstream work to fix the i18n problems which has been
committed to CVS, but there has been no new stable release yet.

Therefore I tried with a CVS snapshot from today.

Unfortunately it dumps core when tried with the testcase I explained 
in comments #5, #6, and #7:

mfabian@magellan:/tmp/bugzilla-98496$ gdb hypermail
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-suse-linux"...
Using host libthread_db library "/lib64/libthread_db.so.1".
(gdb) run -m mbox -d test-archive
Starting program: /usr/bin/hypermail -m mbox -d test-archive

Program received signal SIGBUS, Bus error.
0x00002ba5cbeb4362 in __gconv_transform_internal_utf8 () from /lib64/libc.so.6
(gdb) bt
#0  0x00002ba5cbeb4362 in __gconv_transform_internal_utf8 () from /lib64/libc.so.6
#1  0x00002ba5cc3d47d3 in gconv () from /usr/lib64/gconv/ISO8859-4.so
#2  0x00002ba5cbead948 in __gconv () from /lib64/libc.so.6
#3  0x00002ba5cbeacecf in iconv () from /lib64/libc.so.6
#4  0x0000000000417cdd in i18n_convstring (string=0x672520 "",
    fromcharset=0x7fffdf4b4a40 "ISO-8859-4", tocharset=0x437a59 "UTF-8",
    len=0x7fffdf4b4f88) at string.c:135
#5  0x000000000040ba4a in parsemail (mbox=<value optimized out>,
    use_stdin=<value optimized out>, readone=0, increment=0,
    dir=0x668840 "test-archive/", inlinehtml=1, startnum=0) at parse.c:885
#6  0x0000000000405741 in main (argc=5, argv=0x0) at hypermail.c:644
#7  0x00002ba5cbeac944 in __libc_start_main () from /lib64/libc.so.6
#8  0x00000000004025d9 in _start ()
(gdb) quit
The program is running.  Exit anyway? (y or n) y
mfabian@magellan:/tmp/bugzilla-98496$ grep -ir iso-8859-4 .
./mbox: FLIM/1.14.7 (=?ISO-8859-4?Q?Sanj=F2?=) APEL/10.6 MULE XEmacs/21.5 (beta18)
./mbox: FLIM/1.14.7 (=?ISO-8859-4?Q?Sanj=F2?=) APEL/10.6 MULE XEmacs/21.5 (beta18)
mfabian@magellan:/tmp/bugzilla-98496$ 
Comment 16 Mike Fabian 2007-02-01 16:12:26 UTC
Created attachment 116862 [details]
64bit.patch

The crashes are caused by code which is not correct for 64bit, the
main problem is the use of "int" where "size_t" should be used.

The patch I used to fix it is attached here.
Comment 17 Mike Fabian 2007-02-01 16:34:28 UTC
fixed package submitted to STABLE and to the openSUSE build service.
Comment 18 Mike Fabian 2007-02-01 16:38:19 UTC
Moved the bug to openSUSE 10.3 to make it public.
Comment 19 Mike Fabian 2007-02-01 16:54:34 UTC
64 bit fix submitted upstream.
Comment 20 Mike Fabian 2007-02-01 16:59:15 UTC
Reassign to Hendrik Vogelsang <hvogel@novell.com> to re-generate the broken
mail archives like http://lists.suse.com/archive/suse-linux-ja/ if possible.
Comment 21 Hendrik Vogelsang 2007-02-15 12:00:19 UTC
the archives on suse-linux-ja are gone. Archives are now at lists.opensuse.org which does not use hypermail but mhonarc. Please have a look if its ok there.
Comment 22 Mike Fabian 2007-02-15 12:02:47 UTC
It is OK there. Thank you very much.
Comment 23 Mike Fabian 2007-02-15 12:03:23 UTC
Closing as FIXED.