Bug 78338 - "~" U+FF5E doesn't display with konqueror text area on KDE.
Summary: "~" U+FF5E doesn't display with konqueror text area on KDE.
Status: RESOLVED WONTFIX
Alias: None
Product: openSUSE 10.2
Classification: openSUSE
Component: X11 Applications (show other bugs)
Version: Final
Hardware: Other All
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: Mike Fabian
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-04-18 09:55 UTC by Shinkichi Yamazaki
Modified: 2008-12-10 08:44 UTC (History)
3 users (show)

See Also:
Found By: Other
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Testing image using Firefox and Konqueror (96.70 KB, image/x-png)
2005-04-18 09:57 UTC, Shinkichi Yamazaki
Details
U+301C and U+FF5E font shape image (1.53 KB, image/x-png)
2005-04-19 05:58 UTC, Shinkichi Yamazaki
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Shinkichi Yamazaki 2005-04-18 09:55:36 UTC
I will attach image file to reproduce this bug.

The upper image is from Konqueror, the lower one is from Firefox on KDE 
session. Japanese candidate window is opened with Anthy, then the selecting 
Japanese letter is "~" U+FF5E in Anthy Japanese candidate window on both 
Konqueror and Firefox image. 
 1. The "~" U+FF5E should appear in Anthy Japanese candidate window.
    Note: This behavior appear with KDE session only, don't appear with Gnome 
session.


The selected Japanese letter "~" U+FF5E from Anthy candidate window should 
be displayed in target text area after committing letters from candidate 
window.
  2. In case of Firefox, comittetd "~" U+FF5E be displayed in textarea,
     this is the expected behavior. However, "~" U+FF5E doesn't display 
     in textarea on konqueror.


I doubt the issue is in font code mapping of qt or font it self.
Comment 1 Shinkichi Yamazaki 2005-04-18 09:57:08 UTC
Created attachment 34676 [details]
Testing image using Firefox and Konqueror
Comment 2 Mike Fabian 2005-04-18 10:29:04 UTC
Are you really *sure* that the character in question is "~" U+FF5E?

Could it be that it is "〜" U+301C?

"〜" U+301C is missing in IPAGothic but available in "Sazanami Gothic".

Qt3 currently can only use one for for a certain region. GTK2 however
can fallback to other fonts for single glyphs.

I guess that is what we are seeing here:

Qt3 uses *only* IPAGothic which lacks "〜" U+301C.
GTK2 uses IPAGothic plus some other font like "Sazanami Gothic"
for the glyphs which are missing in IPAGothic.
Comment 3 Shinkichi Yamazaki 2005-04-18 11:22:47 UTC
"~" U+FF5E appears in Japanese candidate window by typing 'kara' after activate
Japanese input mode with 'ctrl+space'. then I made display "~" U+FF5E like as 
the way.  "〜" U+301C does not appear with 'kara' typing. 

On Firefox, try to display "〜" U+301C using copy&paste to gnome-terminal,
displays a similar with "~" U+FF5E font shape glyph but it is not U+FF5E
actually, I guess it is fallback font of "~" U+FF5E that other type of font.

Seeing this bugzilla ID with Konqueror, "〜" U+301C will missing.

To display in html contents from your comments and input text into text area 
from my comments might occurs from different reason.


Comment 4 Mike Fabian 2005-04-18 12:43:40 UTC
syamazaki> "~" U+FF5E appears in Japanese candidate window by typing
syamazaki> 'kara' after activate Japanese input mode with
syamazaki> 'ctrl+space'. then I made display "~" U+FF5E like as the
syamazaki> way.  "〜" U+301C does not appear with 'kara' typing.

I get "〜" U+301C whjen typing 'kara' with scim-anthy.
And I also get "〜" U+301C when typing '~' with scim-anthy.

Are you really sure this is different for you?

I cannot understand why scim-anthy should behave differently for you and
me.

Comment 5 Mike Fabian 2005-04-18 12:45:28 UTC
syamazaki> Seeing this bugzilla ID with Konqueror, "〜" U+301C will missing.

Yes, because IPAGothic doesn't have that character.
And, contrary to GTK2, Qt3 cannot use a fallback font for a single glyph.
Comment 6 Shinkichi Yamazaki 2005-04-19 05:57:25 UTC
I just noticed that the handling U+301C and U+FF5E from Anthy and Wnn8 input 
are interesting, I think Japanese typing 'kara' should address U+FF5E 
actually, but U+301C is applied from both Anthy and Wnn8 input if type 'kara'. 
I am not sure this also the affection from fallback font setting on KDE and 
Gnome.
but the 'kara' typing behavior with Anthy and wnn8 are same on both KDE and 
Gnome session, U+301C will be addressed from Anthy and wnn8.

Regarding the font glyph of U+301C is not used but U+FF5E font glyph is used 
instead of U+301C font glyph. I will attach U+301C and U+FF5E font shape image.


I suppose that uninstall IPA font is one of workaround for this issue on KDE 
at this moment.
Comment 7 Shinkichi Yamazaki 2005-04-19 05:58:55 UTC
Created attachment 34767 [details]
U+301C and U+FF5E font shape image
Comment 8 Mike Fabian 2005-04-19 09:58:49 UTC
Whether the glyphs look like in your attachement in comment #7 seems
to depend on the font used.

"Sazanami Gothic", "Gnu Unifont":

    U+301C and U+FF5E look very similar, both have a hill at
    the left and a valley at the right.

"MS Gothic":

    U+301C and U+FF5E differ, U+301C has a valley at the left
    and a hill at the right which is just the opposite as in
    U+FF5E.

When checking the sample glyphs in the Unicode book, I found that they
look more like in "MS Gothic". That means the glyphs for U+301C in
"Sazanami Gothic" and "Gnu Unifont" are probably not good.

Comment 9 Mike Fabian 2005-04-19 10:24:23 UTC
syamazaki> I just noticed that the handling U+301C and U+FF5E from
syamazaki> Anthy and Wnn8 input are interesting, I think Japanese
syamazaki> typing 'kara' should address U+FF5E actually, but U+301C is
syamazaki> applied from both Anthy and Wnn8 input if type 'kara'.

Same with Canna (checked with Canna via kinput2 and Canna via XEmacs)

If you are sure this is wrong, then maybe we should fix it
in the input methods?

But why do *all* input methods do it that way? Maybe it
is correct after all?

syamazaki> I am not sure this also the affection from fallback font
syamazaki> setting on KDE and Gnome.

What the input methods insert has nothing to do with the
fonts.

syamazaki> but the 'kara' typing behavior with Anthy and wnn8 are same
syamazaki> on both KDE and Gnome session, U+301C will be addressed
syamazaki> from Anthy and wnn8.

Yes, this has nothing to do neither with KDE nor Gnome nor with
the fonts used.

syamazaki> Regarding the font glyph of U+301C is not used but U+FF5E
syamazaki> font glyph is used instead of U+301C font glyph. I will
syamazaki> attach U+301C and U+FF5E font shape image.

You mean the Gnome uses a Glyph which looks like U+FF5E as a fallback
for the missing U+301C in IPAGothic? That is only because it
probably uses "Sazanami Gothic" as a fallback and U+FF5E and U+301C
look very similar in "Sazanami Gothic".

syamazaki> I suppose that uninstall IPA font is one of workaround for
syamazaki> this issue on KDE at this moment.

Yes, but that is not a nice workaround at all because then
"Sazanami Gothic" will be used for everything and "Sazanami Gothic"
is quite ugly.
Comment 10 Mike Fabian 2005-04-19 12:35:21 UTC
I talked with Iwai San during lunch and he probably had the right
idea:

The input methods all work in EUC-JP internally. The results are then
converted to Unicode/UTF-8.

Example: When looking into the source of the dictionary used by Anthy
(gcanna.ctd in the Anthy sources), I find the following entry:

から #T35*263 空 #KJ*255 〜 #M5r*135 絡 #KYmime*131 辛 #S5*130 枯ら #T35*124 殻 #M5*122 から #T35*115 唐 #KJ*10 唐 #KJ 辛 殼 縢

EUC-JP encoded of course.

The codepoint of this 〜 in EUC-JP is "A1C1".

Now when glibc/iconv convert this to Unicode, the resulting Unicode
code point is U+301C.

That's why all input methods output this character as U+301C when
running in an UTF-8 locale.

The question is now:

   Is it correct to convert A1C1 in EUC-JP to U+301C or should
   it rather be converted to U+FF5E instead?



Comment 11 Mike Fabian 2005-04-19 12:38:54 UTC
Add Markus Kuhn to the CC: because he knows a lot about encodings.

Markus, do you have any idea whether the mapping of A1C1 in EUC-JP to
U+301C is correct or not?

Comment 13 Mike Fabian 2005-04-19 15:38:53 UTC
Some observations when converting this character with iconv to make the problem clearer:

mfabian@magellan:~$ perl -e 'binmode STDOUT, ":encoding(utf8)"; print "\x{301c}" ;' | iconv -f utf-8 -t euc-jp | hex
0000  a1 c1                                             ..
mfabian@magellan:~$ perl -e 'binmode STDOUT, ":encoding(utf8)"; print "\x{FF5e}" ;' | iconv -f utf-8 -t euc-jp | hex
0000  8f a2 b7
mfabian@magellan:~$

i.e. when converting from UTF-8 to EUC-JP with iconv, these characters are converted differently.

But when converting to EUC-JP-MS instead:

mfabian@magellan:~$ perl -e 'binmode STDOUT, ":encoding(utf8)"; print "\x{301c}" ;' | iconv -f utf-8 -t euc-jp-ms | hex
0000  a1 c1                                             ..
mfabian@magellan:~$ perl -e 'binmode STDOUT, ":encoding(utf8)"; print "\x{FF5E}" ;' | iconv -f utf-8 -t euc-jp-ms  | hex
0000  a1 c1                                             ..
mfabian@magellan:~$

Both Unicode characters are mapped on the same target code point.

That means converting back and forth between UTF-8 and EUC-JP-MS is not lossless, which you can
also see here:

mfabian@magellan:~$ perl -e 'binmode STDOUT, ":encoding(utf8)"; print "\x{FF5E}" ;' | iconv -f utf-8 -t euc-jp | iconv -f euc-jp-ms -t ucs-2be | hex
0000  ff 5e                                             .^
mfabian@magellan:~$ perl -e 'binmode STDOUT, ":encoding(utf8)"; print "\x{301C}" ;' | iconv -f utf-8 -t euc-jp | iconv -f euc-jp-ms -t ucs-2be | hex
0000  ff 5e                                             .^
mfabian@magellan:~$ 

But converting between UTF-8 and EUC-JP works lossless in both directions:

mfabian@magellan:~$ perl -e 'binmode STDOUT, ":encoding(utf8)"; print "\x{FF5E}" ;' | iconv -f utf-8 -t euc-jp | iconv -f euc-jp -t ucs-2be | hex
0000  ff 5e                                             .^
mfabian@magellan:~$ perl -e 'binmode STDOUT, ":encoding(utf8)"; print "\x{301C}" ;' | iconv -f utf-8 -t euc-jp | iconv -f euc-jp -t ucs-2be | hex
0000  30 1c                                             0.
mfabian@magellan:~$
Comment 14 Mike Fabian 2005-04-19 18:33:00 UTC
I have build experimental packages of scim-anthy to fix this problem.
They are here:

/work/built/mbuild/magellan-mfabian-396/9.3-i386/scim-anthy-0.3.1-3.1.i586.rpm
/work/built/mbuild/magellan-mfabian-396/9.3-i386/scim-anthy-0.3.1-3.1.src.rpm
/work/built/mbuild/magellan-mfabian-396/9.3-x86_64/scim-anthy-0.3.1-3.1.src.rpm
/work/built/mbuild/magellan-mfabian-396/9.3-x86_64/scim-anthy-0.3.1-3.1.x86_64.rpm

Please try!

As the changelog says

    - Bugzilla #78338:
      use conversion from EUC-JP-MS to instead of from EUC-JP.
      Fixes the problem reported in bug #78338 as long as SCIM is
      *not* used via XIM (i.e. via GTK_IM_MODULE, QT_IM_MODULE, or
      the SCIM support in mlterm (mlterm --im=scim).

these packages fix the problem only when scim-anthy is *not* used via XIM!

The patch I applied to scim-anthy looks like this:

diff -ru scim-anthy-0.3.1.orig/src/scim_anthy_imengine.cpp scim-anthy-0.3.1/src/scim_anthy_imengine.cpp
--- scim-anthy-0.3.1.orig/src/scim_anthy_imengine.cpp	2005-01-27 02:53:09.000000000 +0100
+++ scim-anthy-0.3.1/src/scim_anthy_imengine.cpp	2005-04-19 19:45:01.000000000 +0200
@@ -157,7 +157,7 @@
     if (lang.length () >= 2)
         set_languages (lang);
 
-    if (!m_iconv.set_encoding ("EUC-JP"))
+    if (!m_iconv.set_encoding ("EUC-JP-MS"))
         return;
 
     /* config */
diff -ru scim-anthy-0.3.1.orig/src/scim_anthy_preedit.cpp scim-anthy-0.3.1/src/scim_anthy_preedit.cpp
--- scim-anthy-0.3.1.orig/src/scim_anthy_preedit.cpp	2004-12-19 11:39:25.000000000 +0100
+++ scim-anthy-0.3.1/src/scim_anthy_preedit.cpp	2005-04-19 19:44:54.000000000 +0200
@@ -96,7 +96,7 @@
     anthy_context_set_encoding (m_anthy_context, ANTHY_EUC_JP_ENCODING);
 #endif /* HAS_ANTHY_CONTEXT_SET_ENCODING */
 
-    if (!m_iconv.set_encoding ("EUC-JP"))
+    if (!m_iconv.set_encoding ("EUC-JP-MS"))
         return;
 
     set_table (m_typing_method, m_period_style, m_comma_style, m_space_type);


This has the effect that U+FF5E is inserted and not U+301C when "kara"
is typed in scim-anthy.

If you like that change, I think we could apply a similar hack to X11
which would fix the problem for all Japanese input methods using XIM
(Wnn7/8, Canna+kinput2, ...)

I think this would be possible if we change the EUC-JP -> UTF-8
conversion in X11 to map the EUC-JP "a1c1" codepoint to U+FF5E.

Comment 15 Shinkichi Yamazaki 2005-04-20 06:41:36 UTC
A1C1 in EUC-JP to U+301C. I think it is correct and standard mapping.

I tried to input by using updated scim-anthy package for experimental,
The output address from typing 'kara' with Anthy was U+FF5E.
I think this output result is natural, and it is nice to input some sort 
of Japanese writing.

But if "a1c1" codepoint to U+FF5E on X11 mapping, it affects a lot of
applications which have created in past days by users, thus X11 mapping
should as is.

However, 'kara' typing output U+301C is slightly strange.
then if U+301C font glyph exists in system and out put U+301C font glyph from
'kara' typing, it also strange.
so I think dictionary file changing is nice solution for this issue, 
that 'kara' address U+FF5E.
Comment 16 Mike Fabian 2005-04-20 08:15:39 UTC
syamazaki> so I think dictionary file changing is nice solution for this issue, 
syamazaki> that 'kara' address U+FF5E.

It is not possible to change it in the dictionary file because the
dictionaries of all current Japanese input methods are EUC-JP
encoded. I believe it would be nicer if the dictionary files were
UTF-8 encoded because it would make things like this possible, but
currently they are not and this is not so easy to change.

Comment 17 Shinkichi Yamazaki 2005-04-20 08:42:34 UTC
How about to use mapping "a1c1" codepoint to U+FF5E with only Anthy from now 
on? if it is possible, I believe it might nicer userbility for users.
it not comes to mind when U+301C font glyph required from Japanese typing. 
Comment 18 Mike Fabian 2005-04-20 10:42:59 UTC
Which Unicode character is inserted when typing kara in Microsoft Windows?
Comment 19 Shinkichi Yamazaki 2005-04-20 11:01:32 UTC
Inserted Unicode letter on Windows XP is U+FF5E with 'kara' typing.
Comment 20 Mike Fabian 2005-04-20 12:32:40 UTC
Then we should probably try to make the Japanese input methods
on Linux behave the same way.

I'm don't yet know how to do that best. The experimental patch to
scim-anthy from comment #14 causes problems when running in ja_JP.eucJP
locale. I.e. this is not a perfect solution yet.

I need to investigate more.
Comment 21 Markus Kuhn 2005-04-25 13:25:47 UTC
The holy book of Unicode, Version 4.0, page 682, has this to say on U+301C WAVE
DASH:

"This character was encoded to match JIS C 6226-1978 1-33 'wave dash' [EUC 0xa1
0xc1]. Subsequent revisions of the JIS standard and industry practice have
settled on JIS 1-33 as being the full-width tilde character [Unicode U+FF5E]."

In the light of this, I guess there is an out-of-date mapping table being used
somewhere here. In perticular the line

  0x8160 0x2141 0x301C # WAVE DASH

in

  http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT

may be the source of grief here, which is why the Unicode Consortium retired
this mapping table back in August 2001.

See also

  http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/ReadMe.txt
  http://www.debian.or.jp/~kubota/unicode-symbols.html
  http://www.debian.or.jp/~kubota/unicode-symbols-map2.html

for details and email addresses of experts.
Comment 22 Mike Fabian 2007-01-12 15:01:31 UTC
See also bug #233491 for another case where this buggy conversion in glibc
caused a problem.
Comment 23 Stephan Kulow 2007-09-29 09:07:02 UTC
moving forward a bit, but I assume this is rather a dishonest WONTFIX
Comment 24 Marcus Meissner 2008-12-10 08:44:39 UTC
10.2 expired no, so lets close it as WONTFIX