Bug 683857 - man: new Unicode characters in use
Summary: man: new Unicode characters in use
Status: VERIFIED FIXED
Alias: None
Product: openSUSE 11.4
Classification: openSUSE
Component: Basesystem (show other bugs)
Version: Final
Hardware: All Linux
: P3 - Medium : Minor (vote)
Target Milestone: ---
Assignee: Michal Vyskocil
QA Contact: E-mail List
URL:
Whiteboard: maint:released:11.4:41461
Keywords:
Depends on:
Blocks: 698290
  Show dependency treegraph
 
Reported: 2011-03-30 17:50 UTC by Jan Engelhardt
Modified: 2011-06-16 07:36 UTC (History)
3 users (show)

See Also:
Found By: Beta-Customer
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Test manpage (2.95 KB, text/plain)
2011-05-02 15:02 UTC, Jan Engelhardt
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jan Engelhardt 2011-03-30 17:50:17 UTC
Starting with openSUSE 11.4, /usr/bin/man outputs the character U+2010 when it breaks a word where it previously used U+002D. As a result, since many fonts do not have the U+2010 character (including terminus on xterm, and especially the text console), a replacement graphic such as a rectange is displayed instead.

The soft hyphen at U+00AD could be used instead, or switching back to just plain ASCII hyphens.
Comment 1 Dr. Werner Fink 2011-03-31 08:37:59 UTC
man uses groff for character mapping and less for output on the terminal
Comment 2 Michal Vyskocil 2011-04-22 08:53:09 UTC
That seems to be regression of dropped bnc446710.patch - see bug 446710. However it seems the fonts/devutf8/R is not the place for it anymore. With

u2010   24      0       0x002D

in that file I've got

echo "\[u2010]" |  nroff -mandoc -Tutf8 | head -n 1 | od -x
0000000 80e2 0a90
0000004

which is hyphen in utf-8

only ascii seems to produce proper replacement

echo "\[u2010]" |  nroff -mandoc -Tascii | head -n 1 | od -x
0000000 0a2d
0000004

even if I was not able to realize in which .tmac file is this mapping one. There's no big difference in loaded tmac files between devascii and devutf8. Only in later case the unicode.tmac and latin.tmac are called after tty.tmac.

Only one solution I'm aware of is revert the logic of unicode.tmac - instead of current mapping of 0x2d to 0x2010 et all

.\" unicode.tmac
.\"
.char - \[hy]
.char ` \[oq]
.char ' \[cq]
.\" EOF

use

.\" unicode.tmac
.\"
.char \[hy] -
.char \[oq] `
.char \[cq] '
.\" EOF

but that might cause unwanted side-effects in case someone else use non tty output. So maybe we can name it as deunicode.tmac and call it in tty.tmac instead of unicode one.

Werner: what do you think?
Comment 3 Michal Vyskocil 2011-04-28 12:21:30 UTC
uh forget that - I patched tty.tmac to not include unicode.tmac, which changes the 0x2d to 0x2010. I don't think we need to change it back. I'm going to sent a fix to M17N soon.
Comment 4 Michal Vyskocil 2011-04-28 14:51:09 UTC
The problem has been fixed in M17N[1] groff by commit 12 [2]. The tty.tmac no longer include unicode.tmac, so ascii chars will be not replaced. Feel free to test it before I'll submit it to Factory from M17N repository [1].

[1] http://download.opensuse.org/repositories/M17N/openSUSE_11.4/
[2]
https://build.opensuse.org/package/rdiff?commit=12&linkrev=base&package=groff&project=M17N
Comment 5 Jan Engelhardt 2011-04-28 15:29:54 UTC
I have updated to the package, but still see U+2010 used for wordbreaks.
Comment 6 Michal Vyskocil 2011-05-02 14:35:54 UTC
Can you get me an example? Which man page and under which conditions. Thanks.
Comment 7 Jan Engelhardt 2011-05-02 15:02:02 UTC
Created attachment 427556 [details]
Test manpage

groff-1.20.1-183.1.x86_64.rpm from M17N/openSUSE_11.4.

$ locale
LANG=en_US.UTF-8
LC_CTYPE=de_DE.UTF-8
LC_NUMERIC=POSIX
LC_TIME=POSIX
LC_COLLATE=POSIX
LC_MONETARY=POSIX
LC_MESSAGES=nb_NO.UTF-8
LC_PAPER=de_DE.UTF-8
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Running inside xterm-268:
$ man -l test.1 | pcregrep -o '[^\w]+' | sort -u
...
?

When adding | hexdump -C, this will produce "e2 80 90", which is a sign of U+2010.
Comment 8 Michal Vyskocil 2011-06-06 11:00:40 UTC
Updated patch adds the deunicode.tmac, which turns those unicodization off on tty. Then hexdump -C returns

00000000  2d 0a                                             |-.|
00000002

Commited as a revision13 to M17N/groff.
Comment 9 Michal Vyskocil 2011-06-06 11:10:17 UTC
Submitted into openSUSE:Factory by request 72760 - I assume you can use the
version from M17N, so no maintenance update is requested, thus closing.
Comment 10 Bernhard Wiedemann 2011-06-06 16:00:27 UTC
This is an autogenerated message for OBS integration:
This bug (683857) was mentioned in
https://build.opensuse.org/request/show/72760 Factory / groff
Comment 11 Dave Plater 2011-06-07 20:32:24 UTC
(In reply to comment #2)
> That seems to be regression of dropped bnc446710.patch - see bug 446710.
> However it seems the fonts/devutf8/R is not the place for it anymore. With
> 
> u2010   24      0       0x002D
> 
> in that file I've got
> 
> echo "\[u2010]" |  nroff -mandoc -Tutf8 | head -n 1 | od -x
> 0000000 80e2 0a90
> 0000004
> 
> which is hyphen in utf-8
> 
> only ascii seems to produce proper replacement
> 
> echo "\[u2010]" |  nroff -mandoc -Tascii | head -n 1 | od -x
> 0000000 0a2d
> 0000004
> 
> even if I was not able to realize in which .tmac file is this mapping one.
> There's no big difference in loaded tmac files between devascii and devutf8.
> Only in later case the unicode.tmac and latin.tmac are called after tty.tmac.
> 
> Only one solution I'm aware of is revert the logic of unicode.tmac - instead of
> current mapping of 0x2d to 0x2010 et all
> 
> .\" unicode.tmac
> .\"
> .char - \[hy]
> .char ` \[oq]
> .char ' \[cq]
> .\" EOF
> 
> use
> 
> .\" unicode.tmac
> .\"
> .char \[hy] -
> .char \[oq] `
> .char \[cq] '
> .\" EOF
> 
> but that might cause unwanted side-effects in case someone else use non tty
> output. So maybe we can name it as deunicode.tmac and call it in tty.tmac
> instead of unicode one.
> 
> Werner: what do you think?

I came upon this bug while googling deunicode.tmac due to a new rpmlint error for a few package's man pages. This is from lilv, a package I'm preparing for factory :
lilv.x86_64: W: manual-page-warning /usr/share/man/man1/lv2jack.1.gz 69: can't find macro file `deunicode.tmac'
lilv.x86_64: W: manual-page-warning /usr/share/man/man1/serdi.1.gz 69: can't find macro file `deunicode.tmac'
lilv.x86_64: W: manual-page-warning /usr/share/man/man3/lilv.3.gz 69: can't find macro file `deunicode.tmac'
lilv.x86_64: W: manual-page-warning /usr/share/man/man3/SerdURI.3.gz 69: can't find macro file `deunicode.tmac'
lilv.x86_64: W: manual-page-warning /usr/share/man/man3/SerdNode.3.gz 69: can't find macro file `deunicode.tmac'
lilv.x86_64: W: manual-page-warning /usr/share/man/man1/sordi.1.gz 69: can't find macro file `deunicode.tmac'
lilv.x86_64: W: manual-page-warning /usr/share/man/man3/serd.3.gz 69: can't find macro file `deunicode.tmac'
lilv.x86_64: W: manual-page-warning /usr/share/man/man3/SerdChunk.3.gz 69: can't find macro file `deunicode.tmac'
lilv.x86_64: W: manual-page-warning /usr/share/man/man3/sord.3.gz 69: can't find macro file `deunicode.tmac'
lilv.x86_64: W: manual-page-warning /usr/share/man/man1/lv2ls.1.gz 69: can't find macro file `deunicode.tmac'
lilv.x86_64: W: manual-page-warning /usr/share/man/man1/lv2info.1.gz 69: can't find macro file `deunicode.tmac'
This man page may contain problems that can cause it not to be formatted as
intended.

Is there a package that provides deunicode.tmac?
Comment 12 Jan Engelhardt 2011-06-07 20:41:28 UTC
As of 

* Mon Jun 06 2011 mvyskocil@suse.cz
-
- fix bnc#682913: device X100 is missing
  * create new groff-devx package containing all devX devices, as they
    need X for build
- fix bnc#683857: Unicode characters in use
  * groff-1.20.1-deunicode.patch adds deunicode.tmac to tty.tmac removes
    all unecessary unicode characters in tty output

I still get 0x2010 as a dash separator.
Comment 13 Jan Engelhardt 2011-06-07 20:41:48 UTC
-
Comment 14 Michal Vyskocil 2011-06-08 09:29:20 UTC
Sorry, I accidentally tested the groff from 11.3. However the deunicode.tmac is not the proper solution. The working one is simple - change the soft-hyphenation char to -

That is what the new version is doing

# To be sure I'm testing the right version!
$ rpm -q --changelog groff | head -n 4* Wed Jun 08 2011 mvyskocil@suse.cz
- fix bnc#683857: Unicode characters in use properly
  * change the soft hyphenation char to - in tty.tmac
$ man -l test.1 | pcregrep -o '[^\w]+'  | sort -u | grep -- '-' | hexdump -C
00000000  2d 0a                                             |-.|
00000002

Commited as revision 17 to M17N/groff
Comment 15 Jan Engelhardt 2011-06-08 14:01:48 UTC
Now does what was wanted.
Comment 16 Bernhard Wiedemann 2011-06-09 10:00:15 UTC
This is an autogenerated message for OBS integration:
This bug (683857) was mentioned in
https://build.opensuse.org/request/show/73067 11.4 / groff
https://build.opensuse.org/request/show/73070 Factory / groff
Comment 17 Swamp Workflow Management 2011-06-16 07:36:02 UTC
Update released for: groff, groff-debuginfo, groff-doc
Products:
openSUSE 11.4 (debug, i586, x86_64)