Bugzilla – Bug 98496
hypermail encoding problems
Last modified: 2007-02-15 12:03:23 UTC
Hypermail 2.1.8 has many problems with mail archives when the archive contains mails with different subject encodings. I'll attach sample files and screen shots to reproduce the problem.
Reassigned to Anna Bernathova <anicka@novell.com>, maintainer of our hypermail package.
As an example of the bug, see: http://lists.suse.com/archive/suse-linux-ja/2005-Jul/ You can see that the subjects of the first 5 mails on that page are displayed correctly (in UTF-8). But from the 6th mail on, all subjects are garbled. This is because the encoding of the subjects of these mails is ISO-2022-JP. That means, the .html page produced contains text with two different encodings which cannot work. You can see that if you switch the encoding of your browser manually from UTF-8 to ISO-2022-JP and back. When using UTF-8, you see the subjects at the top correctly, when using ISO-2022-JP, you see the mails at the bottom correctly. But never both.
When looking at the sources of hypermail 2.1.8 and 2.2.0, I found that there is very little support for different encodings, basically only for ISO-8859-1 and ISO-2022-JP. And even that appears to be partly broken.
I updated hypermail to 2.2.0 in STABLE. It doesn't fix the problem reported in this bug, but if we try to fix it we should probably start with the latest version.
Created attachment 43434 [details] mbox Unix mailbox file for testing hypermail and reproducing the bug.
A test archive which shows the bug can be created as follows, using the "mbox" file attached to comment #5: mfabian@magellan:/tmp/hypermail-test$ hypermail -m mbox -d test-archive mfabian@magellan:/tmp/hypermail-test$ ls test-archive/ 0000.html 0013.html 0026.html 0039.html 0052.html 0065.html 0001.html 0014.html 0027.html 0040.html 0053.html 0066.html 0002.html 0015.html 0028.html 0041.html 0054.html 0067.html 0003.html 0016.html 0029.html 0042.html 0055.html 0068.html 0004.html 0017.html 0030.html 0043.html 0056.html 0069.html 0005.html 0018.html 0031.html 0044.html 0057.html 0070.html 0006.html 0019.html 0032.html 0045.html 0058.html 0071.html 0007.html 0020.html 0033.html 0046.html 0059.html 0072.html 0008.html 0021.html 0034.html 0047.html 0060.html attachment.html 0009.html 0022.html 0035.html 0048.html 0061.html author.html 0010.html 0023.html 0036.html 0049.html 0062.html date.html 0011.html 0024.html 0037.html 0050.html 0063.html index.html 0012.html 0025.html 0038.html 0051.html 0064.html subject.html mfabian@magellan:/tmp/hypermail-test$
Now view the file "index.html" created as explained in comment #6 with a browser, e.g. Firefox. Switch the browser between UTF-8 encoding and ISO-2022-JP encoding. When using UTF-8, the first 8 subjects are displayed correctly, all other subjects are garbage. When using ISO-2022-JP encoding, the subjects up to the 8th subject are displayed as garbage and the subjects below the 8th subject become partly readable. Note the *partly*, even if switching the browser to ISO-2022-JP, the ISO-2022-JP encoded subjects become only partly readable.
The partly broken ISO-2022-JP can be fixed by using the option "iso2022jp = On" in ~/.hmrc: mfabian@magellan:~$ cat .hmrc iso2022jp = On mfabian@magellan:~$ That doesn't help when index.html contains a mixture of subjects with different encodings. When looking at the archive http://lists.suse.com/archive/suse-linux-ja/ most months are OK because UTF-8 encoded subjects are still rare in Japanese mails. The most common encoding for Japanese mails is still ISO-2022-JP and for most months in that archive, this happened to be the only encoding which was used in subjects. But each time somebody uses a different encoding for a subject, e.g. UTF-8, the archive for that month will end up at least partly as garbage.
I think the best fix is to improve hypermail to convert all the output to UTF-8 *always*. That is the only possibility to get a single target encoding for index.html, date.html, etc. even if many different encodings are used in the subjects of the original mails. Changing hypermail to do that appears to be not really difficult but still a lot of tedious work.
I reported the problem on the hypermail mailing list: http://www.hypermail-project.org/archive/05/07/2715.html and there was one response so far: http://www.hypermail-project.org/archive/05/07/2716.html Maybe I'll have to write a patch.
Whats the situation so far? Are you going to write the patch? I have looked at it a bit and it seems to be quite complicated.
There was one response by Daigo Matsubara: http://www.hypermail-project.org/archive/05/07/2719.html From: Daigo Matsubara <daigo@w3.org> Subject: Re: hypermail encoding problems To: mfabian@suse.de Cc: "Peter C. McCluskey" <pcm@rahul.net>, hypermail@hypermail-project.org, Daigo Matsubara <daigo@w3.org> Date: Fri, 29 Jul 2005 12:51:19 +0900 Organization: World Wide Web Consortium/Keio University Gnus-Warning: This is a duplicate of message <nm8iryu1d8o.wl@w3.mag.keio.ac.jp> User-Agent: Wanderlust/2.10.1 (Watching The Wheels) SEMI/1.14.6 (Maruoka) FLIM/1.14.6 (Marutamachi) APEL/10.6 Emacs/21.4 (i386-pc-linux-gnu) MULE/5.0 (SAKAKI) Content-Type: text/plain; charset=ISO-2022-JP At Thu, 28 Jul 2005 11:15:02 +0200, Mike FABIAN wrote: > > "Peter C. McCluskey" <pcm@rahul.net> さんは書きました: > > > mfabian@suse.de (Mike FABIAN) writes: > >>I think the best fix is to improve hypermail to convert all the output > >>to UTF-8 *always*. That is the only possibility to get a single target > >>encoding for index.html, date.html, etc. even if many different > >>encodings are used in the subjects of the original mails. > >> > >>Are there any plans to add such support for UTF-8 to hypermail in the > >>near future? > > > > I suspect you are right about what should be done. > > > > Daigo Matsubara <daigo@w3.org> has reportedly made some changes to > > support UTF-8 which haven't been checked in yet. If he isn't addressing > > the problem you mention, then it's unlikely that anyone has plans to > > (in which case I would encourage you to submit a patch). > > Matsubara San, > > does your patch address the problem? If not I'll try to make a patch. Hi Mike, I had implemented roughly, it is working on my testbed. My strategy is: 1) convert every headers to UTF-8 at first. I modified mdecodeRFC2047() to do it. 2) call print_main_header() with UTF-8 charset, then Hypermail outputs indexes in UTF-8. 3) in each message, each references (subjects of other messages in thread) are encoded in numeric reference. message body is not converted to UTF-8. I'm still considering about 3). I was suggested by I18N experts to make everything in UTF-8, but I'm wondering about that because we still have a lot of software which is not UTF-8 friendly. But, at least, it solves encoding issue you mentioned, I think. I'm trying to show my code ASAP to have review. Ideas/thoughts are welcome. Thanks, -- Daigo Matsubara / W3C Systems Team / mailto:daigo@w3.org
I replied: http://www.hypermail-project.org/archive/05/07/2720.html From: Mike FABIAN <mfabian@suse.de> Subject: Re: hypermail encoding problems To: Daigo Matsubara <daigo@w3.org> Cc: "Peter C. McCluskey" <pcm@rahul.net>, hypermail@hypermail-project.org Date: Fri, 29 Jul 2005 15:22:34 +0200 Reply-to: mfabian@suse.de User-Agent: Gnus/5.1006 (Gnus v5.10.6) XEmacs/21.5 (corn, linux) Content-Type: text/plain; charset=iso-2022-jp Daigo Matsubara <daigo@w3.org> さんは書きました: > I had implemented roughly, it is working on my testbed. > > My strategy is: > > 1) convert every headers to UTF-8 at first. I modified > mdecodeRFC2047() to do it. > > 2) call print_main_header() with UTF-8 charset, then Hypermail outputs > indexes in UTF-8. > > 3) in each message, each references (subjects of other messages in > thread) are encoded in numeric reference. message body is not > converted to UTF-8. > > I'm still considering about 3). I was suggested by I18N experts to > make everything in UTF-8, but I'm wondering about that because we > still have a lot of software which is not UTF-8 friendly. In my experience, it is not much of a problem anymore, all common browsers seem to support UTF-8 well enough. > But, at least, it solves encoding issue you mentioned, I think. > I'm trying to show my code ASAP to have review. Ideas/thoughts are > welcome. Yes, please post your code. I'd like to try it as well. -- Mike FABIAN <mfabian@suse.de> http://www.suse.de/~mfabian 睡眠不足はいい仕事の敵だ。
Anna> Are you going to write the patch? I have looked at it a bit and Anna> it seems to be quite complicated. Let's wait a while whether Daigo Matsubara posts his patch. I don't want to duplicate his work. Anyway I'm too busy now because of SuSE Linux 10.0. After the SuSE Linux 10.0 release, if there is still no patch available, I'll try to write a patch myself.
There has been upstream work to fix the i18n problems which has been committed to CVS, but there has been no new stable release yet. Therefore I tried with a CVS snapshot from today. Unfortunately it dumps core when tried with the testcase I explained in comments #5, #6, and #7: mfabian@magellan:/tmp/bugzilla-98496$ gdb hypermail GNU gdb 6.6 Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-suse-linux"... Using host libthread_db library "/lib64/libthread_db.so.1". (gdb) run -m mbox -d test-archive Starting program: /usr/bin/hypermail -m mbox -d test-archive Program received signal SIGBUS, Bus error. 0x00002ba5cbeb4362 in __gconv_transform_internal_utf8 () from /lib64/libc.so.6 (gdb) bt #0 0x00002ba5cbeb4362 in __gconv_transform_internal_utf8 () from /lib64/libc.so.6 #1 0x00002ba5cc3d47d3 in gconv () from /usr/lib64/gconv/ISO8859-4.so #2 0x00002ba5cbead948 in __gconv () from /lib64/libc.so.6 #3 0x00002ba5cbeacecf in iconv () from /lib64/libc.so.6 #4 0x0000000000417cdd in i18n_convstring (string=0x672520 "", fromcharset=0x7fffdf4b4a40 "ISO-8859-4", tocharset=0x437a59 "UTF-8", len=0x7fffdf4b4f88) at string.c:135 #5 0x000000000040ba4a in parsemail (mbox=<value optimized out>, use_stdin=<value optimized out>, readone=0, increment=0, dir=0x668840 "test-archive/", inlinehtml=1, startnum=0) at parse.c:885 #6 0x0000000000405741 in main (argc=5, argv=0x0) at hypermail.c:644 #7 0x00002ba5cbeac944 in __libc_start_main () from /lib64/libc.so.6 #8 0x00000000004025d9 in _start () (gdb) quit The program is running. Exit anyway? (y or n) y mfabian@magellan:/tmp/bugzilla-98496$ grep -ir iso-8859-4 . ./mbox: FLIM/1.14.7 (=?ISO-8859-4?Q?Sanj=F2?=) APEL/10.6 MULE XEmacs/21.5 (beta18) ./mbox: FLIM/1.14.7 (=?ISO-8859-4?Q?Sanj=F2?=) APEL/10.6 MULE XEmacs/21.5 (beta18) mfabian@magellan:/tmp/bugzilla-98496$
Created attachment 116862 [details] 64bit.patch The crashes are caused by code which is not correct for 64bit, the main problem is the use of "int" where "size_t" should be used. The patch I used to fix it is attached here.
fixed package submitted to STABLE and to the openSUSE build service.
Moved the bug to openSUSE 10.3 to make it public.
64 bit fix submitted upstream.
Reassign to Hendrik Vogelsang <hvogel@novell.com> to re-generate the broken mail archives like http://lists.suse.com/archive/suse-linux-ja/ if possible.
the archives on suse-linux-ja are gone. Archives are now at lists.opensuse.org which does not use hypermail but mhonarc. Please have a look if its ok there.
It is OK there. Thank you very much.
Closing as FIXED.