|
Bugzilla – Full Text Bug Listing |
| Summary: | "~" U+FF5E doesn't display with konqueror text area on KDE. | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE 10.2 | Reporter: | Shinkichi Yamazaki <shinkichi.yamazaki> |
| Component: | X11 Applications | Assignee: | Mike Fabian <mfabian> |
| Status: | RESOLVED WONTFIX | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P5 - None | CC: | lmuelle, mgk25, tiwai |
| Version: | Final | ||
| Target Milestone: | --- | ||
| Hardware: | Other | ||
| OS: | All | ||
| Whiteboard: | |||
| Found By: | Other | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
Testing image using Firefox and Konqueror
U+301C and U+FF5E font shape image |
||
Created attachment 34676 [details]
Testing image using Firefox and Konqueror
Are you really *sure* that the character in question is "~" U+FF5E? Could it be that it is "〜" U+301C? "〜" U+301C is missing in IPAGothic but available in "Sazanami Gothic". Qt3 currently can only use one for for a certain region. GTK2 however can fallback to other fonts for single glyphs. I guess that is what we are seeing here: Qt3 uses *only* IPAGothic which lacks "〜" U+301C. GTK2 uses IPAGothic plus some other font like "Sazanami Gothic" for the glyphs which are missing in IPAGothic. "~" U+FF5E appears in Japanese candidate window by typing 'kara' after activate Japanese input mode with 'ctrl+space'. then I made display "~" U+FF5E like as the way. "〜" U+301C does not appear with 'kara' typing. On Firefox, try to display "〜" U+301C using copy&paste to gnome-terminal, displays a similar with "~" U+FF5E font shape glyph but it is not U+FF5E actually, I guess it is fallback font of "~" U+FF5E that other type of font. Seeing this bugzilla ID with Konqueror, "〜" U+301C will missing. To display in html contents from your comments and input text into text area from my comments might occurs from different reason. syamazaki> "~" U+FF5E appears in Japanese candidate window by typing syamazaki> 'kara' after activate Japanese input mode with syamazaki> 'ctrl+space'. then I made display "~" U+FF5E like as the syamazaki> way. "〜" U+301C does not appear with 'kara' typing. I get "〜" U+301C whjen typing 'kara' with scim-anthy. And I also get "〜" U+301C when typing '~' with scim-anthy. Are you really sure this is different for you? I cannot understand why scim-anthy should behave differently for you and me. syamazaki> Seeing this bugzilla ID with Konqueror, "〜" U+301C will missing. Yes, because IPAGothic doesn't have that character. And, contrary to GTK2, Qt3 cannot use a fallback font for a single glyph. I just noticed that the handling U+301C and U+FF5E from Anthy and Wnn8 input are interesting, I think Japanese typing 'kara' should address U+FF5E actually, but U+301C is applied from both Anthy and Wnn8 input if type 'kara'. I am not sure this also the affection from fallback font setting on KDE and Gnome. but the 'kara' typing behavior with Anthy and wnn8 are same on both KDE and Gnome session, U+301C will be addressed from Anthy and wnn8. Regarding the font glyph of U+301C is not used but U+FF5E font glyph is used instead of U+301C font glyph. I will attach U+301C and U+FF5E font shape image. I suppose that uninstall IPA font is one of workaround for this issue on KDE at this moment. Created attachment 34767 [details]
U+301C and U+FF5E font shape image
Whether the glyphs look like in your attachement in comment #7 seems to depend on the font used. "Sazanami Gothic", "Gnu Unifont": U+301C and U+FF5E look very similar, both have a hill at the left and a valley at the right. "MS Gothic": U+301C and U+FF5E differ, U+301C has a valley at the left and a hill at the right which is just the opposite as in U+FF5E. When checking the sample glyphs in the Unicode book, I found that they look more like in "MS Gothic". That means the glyphs for U+301C in "Sazanami Gothic" and "Gnu Unifont" are probably not good. syamazaki> I just noticed that the handling U+301C and U+FF5E from syamazaki> Anthy and Wnn8 input are interesting, I think Japanese syamazaki> typing 'kara' should address U+FF5E actually, but U+301C is syamazaki> applied from both Anthy and Wnn8 input if type 'kara'. Same with Canna (checked with Canna via kinput2 and Canna via XEmacs) If you are sure this is wrong, then maybe we should fix it in the input methods? But why do *all* input methods do it that way? Maybe it is correct after all? syamazaki> I am not sure this also the affection from fallback font syamazaki> setting on KDE and Gnome. What the input methods insert has nothing to do with the fonts. syamazaki> but the 'kara' typing behavior with Anthy and wnn8 are same syamazaki> on both KDE and Gnome session, U+301C will be addressed syamazaki> from Anthy and wnn8. Yes, this has nothing to do neither with KDE nor Gnome nor with the fonts used. syamazaki> Regarding the font glyph of U+301C is not used but U+FF5E syamazaki> font glyph is used instead of U+301C font glyph. I will syamazaki> attach U+301C and U+FF5E font shape image. You mean the Gnome uses a Glyph which looks like U+FF5E as a fallback for the missing U+301C in IPAGothic? That is only because it probably uses "Sazanami Gothic" as a fallback and U+FF5E and U+301C look very similar in "Sazanami Gothic". syamazaki> I suppose that uninstall IPA font is one of workaround for syamazaki> this issue on KDE at this moment. Yes, but that is not a nice workaround at all because then "Sazanami Gothic" will be used for everything and "Sazanami Gothic" is quite ugly. I talked with Iwai San during lunch and he probably had the right idea: The input methods all work in EUC-JP internally. The results are then converted to Unicode/UTF-8. Example: When looking into the source of the dictionary used by Anthy (gcanna.ctd in the Anthy sources), I find the following entry: から #T35*263 空 #KJ*255 〜 #M5r*135 絡 #KYmime*131 辛 #S5*130 枯ら #T35*124 殻 #M5*122 から #T35*115 唐 #KJ*10 唐 #KJ 辛 殼 縢 EUC-JP encoded of course. The codepoint of this 〜 in EUC-JP is "A1C1". Now when glibc/iconv convert this to Unicode, the resulting Unicode code point is U+301C. That's why all input methods output this character as U+301C when running in an UTF-8 locale. The question is now: Is it correct to convert A1C1 in EUC-JP to U+301C or should it rather be converted to U+FF5E instead? Add Markus Kuhn to the CC: because he knows a lot about encodings. Markus, do you have any idea whether the mapping of A1C1 in EUC-JP to U+301C is correct or not? Some observations when converting this character with iconv to make the problem clearer:
mfabian@magellan:~$ perl -e 'binmode STDOUT, ":encoding(utf8)"; print "\x{301c}" ;' | iconv -f utf-8 -t euc-jp | hex
0000 a1 c1 ..
mfabian@magellan:~$ perl -e 'binmode STDOUT, ":encoding(utf8)"; print "\x{FF5e}" ;' | iconv -f utf-8 -t euc-jp | hex
0000 8f a2 b7
mfabian@magellan:~$
i.e. when converting from UTF-8 to EUC-JP with iconv, these characters are converted differently.
But when converting to EUC-JP-MS instead:
mfabian@magellan:~$ perl -e 'binmode STDOUT, ":encoding(utf8)"; print "\x{301c}" ;' | iconv -f utf-8 -t euc-jp-ms | hex
0000 a1 c1 ..
mfabian@magellan:~$ perl -e 'binmode STDOUT, ":encoding(utf8)"; print "\x{FF5E}" ;' | iconv -f utf-8 -t euc-jp-ms | hex
0000 a1 c1 ..
mfabian@magellan:~$
Both Unicode characters are mapped on the same target code point.
That means converting back and forth between UTF-8 and EUC-JP-MS is not lossless, which you can
also see here:
mfabian@magellan:~$ perl -e 'binmode STDOUT, ":encoding(utf8)"; print "\x{FF5E}" ;' | iconv -f utf-8 -t euc-jp | iconv -f euc-jp-ms -t ucs-2be | hex
0000 ff 5e .^
mfabian@magellan:~$ perl -e 'binmode STDOUT, ":encoding(utf8)"; print "\x{301C}" ;' | iconv -f utf-8 -t euc-jp | iconv -f euc-jp-ms -t ucs-2be | hex
0000 ff 5e .^
mfabian@magellan:~$
But converting between UTF-8 and EUC-JP works lossless in both directions:
mfabian@magellan:~$ perl -e 'binmode STDOUT, ":encoding(utf8)"; print "\x{FF5E}" ;' | iconv -f utf-8 -t euc-jp | iconv -f euc-jp -t ucs-2be | hex
0000 ff 5e .^
mfabian@magellan:~$ perl -e 'binmode STDOUT, ":encoding(utf8)"; print "\x{301C}" ;' | iconv -f utf-8 -t euc-jp | iconv -f euc-jp -t ucs-2be | hex
0000 30 1c 0.
mfabian@magellan:~$
I have build experimental packages of scim-anthy to fix this problem.
They are here:
/work/built/mbuild/magellan-mfabian-396/9.3-i386/scim-anthy-0.3.1-3.1.i586.rpm
/work/built/mbuild/magellan-mfabian-396/9.3-i386/scim-anthy-0.3.1-3.1.src.rpm
/work/built/mbuild/magellan-mfabian-396/9.3-x86_64/scim-anthy-0.3.1-3.1.src.rpm
/work/built/mbuild/magellan-mfabian-396/9.3-x86_64/scim-anthy-0.3.1-3.1.x86_64.rpm
Please try!
As the changelog says
- Bugzilla #78338:
use conversion from EUC-JP-MS to instead of from EUC-JP.
Fixes the problem reported in bug #78338 as long as SCIM is
*not* used via XIM (i.e. via GTK_IM_MODULE, QT_IM_MODULE, or
the SCIM support in mlterm (mlterm --im=scim).
these packages fix the problem only when scim-anthy is *not* used via XIM!
The patch I applied to scim-anthy looks like this:
diff -ru scim-anthy-0.3.1.orig/src/scim_anthy_imengine.cpp scim-anthy-0.3.1/src/scim_anthy_imengine.cpp
--- scim-anthy-0.3.1.orig/src/scim_anthy_imengine.cpp 2005-01-27 02:53:09.000000000 +0100
+++ scim-anthy-0.3.1/src/scim_anthy_imengine.cpp 2005-04-19 19:45:01.000000000 +0200
@@ -157,7 +157,7 @@
if (lang.length () >= 2)
set_languages (lang);
- if (!m_iconv.set_encoding ("EUC-JP"))
+ if (!m_iconv.set_encoding ("EUC-JP-MS"))
return;
/* config */
diff -ru scim-anthy-0.3.1.orig/src/scim_anthy_preedit.cpp scim-anthy-0.3.1/src/scim_anthy_preedit.cpp
--- scim-anthy-0.3.1.orig/src/scim_anthy_preedit.cpp 2004-12-19 11:39:25.000000000 +0100
+++ scim-anthy-0.3.1/src/scim_anthy_preedit.cpp 2005-04-19 19:44:54.000000000 +0200
@@ -96,7 +96,7 @@
anthy_context_set_encoding (m_anthy_context, ANTHY_EUC_JP_ENCODING);
#endif /* HAS_ANTHY_CONTEXT_SET_ENCODING */
- if (!m_iconv.set_encoding ("EUC-JP"))
+ if (!m_iconv.set_encoding ("EUC-JP-MS"))
return;
set_table (m_typing_method, m_period_style, m_comma_style, m_space_type);
This has the effect that U+FF5E is inserted and not U+301C when "kara"
is typed in scim-anthy.
If you like that change, I think we could apply a similar hack to X11
which would fix the problem for all Japanese input methods using XIM
(Wnn7/8, Canna+kinput2, ...)
I think this would be possible if we change the EUC-JP -> UTF-8
conversion in X11 to map the EUC-JP "a1c1" codepoint to U+FF5E.
A1C1 in EUC-JP to U+301C. I think it is correct and standard mapping. I tried to input by using updated scim-anthy package for experimental, The output address from typing 'kara' with Anthy was U+FF5E. I think this output result is natural, and it is nice to input some sort of Japanese writing. But if "a1c1" codepoint to U+FF5E on X11 mapping, it affects a lot of applications which have created in past days by users, thus X11 mapping should as is. However, 'kara' typing output U+301C is slightly strange. then if U+301C font glyph exists in system and out put U+301C font glyph from 'kara' typing, it also strange. so I think dictionary file changing is nice solution for this issue, that 'kara' address U+FF5E. syamazaki> so I think dictionary file changing is nice solution for this issue, syamazaki> that 'kara' address U+FF5E. It is not possible to change it in the dictionary file because the dictionaries of all current Japanese input methods are EUC-JP encoded. I believe it would be nicer if the dictionary files were UTF-8 encoded because it would make things like this possible, but currently they are not and this is not so easy to change. How about to use mapping "a1c1" codepoint to U+FF5E with only Anthy from now on? if it is possible, I believe it might nicer userbility for users. it not comes to mind when U+301C font glyph required from Japanese typing. Which Unicode character is inserted when typing kara in Microsoft Windows? Inserted Unicode letter on Windows XP is U+FF5E with 'kara' typing. Then we should probably try to make the Japanese input methods on Linux behave the same way. I'm don't yet know how to do that best. The experimental patch to scim-anthy from comment #14 causes problems when running in ja_JP.eucJP locale. I.e. this is not a perfect solution yet. I need to investigate more. The holy book of Unicode, Version 4.0, page 682, has this to say on U+301C WAVE DASH: "This character was encoded to match JIS C 6226-1978 1-33 'wave dash' [EUC 0xa1 0xc1]. Subsequent revisions of the JIS standard and industry practice have settled on JIS 1-33 as being the full-width tilde character [Unicode U+FF5E]." In the light of this, I guess there is an out-of-date mapping table being used somewhere here. In perticular the line 0x8160 0x2141 0x301C # WAVE DASH in http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT may be the source of grief here, which is why the Unicode Consortium retired this mapping table back in August 2001. See also http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/ReadMe.txt http://www.debian.or.jp/~kubota/unicode-symbols.html http://www.debian.or.jp/~kubota/unicode-symbols-map2.html for details and email addresses of experts. See also bug #233491 for another case where this buggy conversion in glibc caused a problem. moving forward a bit, but I assume this is rather a dishonest WONTFIX 10.2 expired no, so lets close it as WONTFIX |
I will attach image file to reproduce this bug. The upper image is from Konqueror, the lower one is from Firefox on KDE session. Japanese candidate window is opened with Anthy, then the selecting Japanese letter is "~" U+FF5E in Anthy Japanese candidate window on both Konqueror and Firefox image. 1. The "~" U+FF5E should appear in Anthy Japanese candidate window. Note: This behavior appear with KDE session only, don't appear with Gnome session. The selected Japanese letter "~" U+FF5E from Anthy candidate window should be displayed in target text area after committing letters from candidate window. 2. In case of Firefox, comittetd "~" U+FF5E be displayed in textarea, this is the expected behavior. However, "~" U+FF5E doesn't display in textarea on konqueror. I doubt the issue is in font code mapping of qt or font it self.