Bug 152778 - LC_COLLATE=es_ES ignores blanks
Summary: LC_COLLATE=es_ES ignores blanks
Status: RESOLVED FIXED
Alias: None
Product: openSUSE 10.2
Classification: openSUSE
Component: Basesystem (show other bugs)
Version: Alpha 1
Hardware: All Linux
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: Mike Fabian
QA Contact: E-mail List
URL:
Whiteboard: wasL3 -> 20060222430000177
Keywords:
Depends on:
Blocks:
 
Reported: 2006-02-22 13:45 UTC by Julius Stricker
Modified: 2008-04-01 16:53 UTC (History)
4 users (show)

See Also:
Found By: Customer
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Julius Stricker 2006-02-22 13:45:42 UTC
When LC_COLLATE=es_ES, the sort command ignores spaces in its sorting
algorithm, so it sorts
	MAS PUJADAS, FRANCESC
after
	MASOLIVER GARCIA, JAIME	
instead of before, even though the comments in
/usr/share/i18n/locales/es_ES indicate that the sorting algorithm for this
locales should take spaces into account (and sort them before punctuation
characters, numbers and letters).

This spanish customer is not using LC_COLLATE="POSIX" because the sort  command gives incorrect results when dealing with characters with spanish 
accents so he has to use LC_COLLATE="es_ES.UTF-8" which is ignoring spaces. 
Even /usr/share/i18n/locales/es_ES states:


LC_COLLATE

% Base collation scheme: 1994-03-22

% Ordering algorithm:
%  1. Spaces and hyphen (but not soft
%     hyphen) before punctuation
%     characters, punctuation characters
%     before numbers,
%     numbers before letters.

I also tested it with every other language setting and the results are always the same:

mortlach:~ # export LC_COLLATE="POSIX"
mortlach:~ # sort demo

AB CDESY
ABC DETZ
ABCD ETX

mortlach:~ # export LC_COLLATE="en_GB.UTF-8"
mortlach:~ # sort demo

AB CDESY
ABCD ETX
ABC DETZ

mortlach:~ # export LC_COLLATE="de_DE.UTF-8"
mortlach:~ # sort demo

AB CDESY
ABCD ETX
ABC DETZ

So the question is why LC_COLLATE="POSIX" behaves differently to any other language setting, if this is a feature where is it documented and why is it so? It doesn't make sence that LC_COLLATE="POSIX" behaves different to the English settings (UK & US) which on the other hand behave exactly the same way as any other language setting so there must be a reason why this is so
Comment 3 Holger Hetterich 2006-03-21 15:21:07 UTC
reassigning, Mike, could you take look?
Comment 4 Markus Kuhn 2006-03-21 18:24:17 UTC
Glibc implements a 4-pass sorting algorithm, something like the Unicode Collation Algorithm defined at

  http://www.unicode.org/reports/tr10/

or equivalently the International Standard Ordering defined in ISO 14651. The SPACE is not ignored, it affects the sorting order only with lower priority than 

  - the base characters
  - accents
  - whether base characters are uppercase or lower case

At level 4, space is treated like punctuation.

The Unicode sorting algorithm has lots of options. If you look at

  http://www.unicode.org/reports/tr10/#Variable_Weighting

you will see that variable weighting options are avaliable for characters such as SPACE. Perhaps the UTF-8 locales were configured to use something equivalent to the "blanked" option, whereas what the user expects here is the "non-ignorable" option?

It is up to the locale designer to chose these options, and I suspect the necessary discussion on which options are best here has never taken place.

The culprit is probably in the file

  /usr/share/i18n/locales/iso14651_t1

the line

  <U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>

which says that SPACE is sorted at level 4 only, i.e. with lowest priority. I don't think this is a particularly good choice.

File format spec:
http://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14652.pdf

People like Ulrich Drepper, Alain LaBonté, Keld J. Simonsen would know more on the origins of this.
Comment 5 Mike Fabian 2006-03-23 10:37:08 UTC
If the customer needs this only for the "sort" command and space is
the only character which causes a problem, she can use the following
workaround:

mfabian@magellan:~/test-texts$ LANG=es_ES.UTF-8 sort -t ' ' -k 1,1 -k 2,2 -k 3,3 demo
AB CDESY
ABC DETZ
ABCD ETX
mfabian@magellan:~/test-texts$

i.e. specify the space character as a field separator and then list
all fields as sort keys.

Comment 6 Mike Fabian 2006-03-28 15:17:15 UTC
Is the workaround described in comment #5 a sufficiently good
solution for the customer?

Or is it necessary to fix this problem in glibc *now*?
Comment 7 Holger Achtziger 2006-03-28 17:50:20 UTC
we will check with the customer.
Comment 8 Ray Dassen 2006-04-10 07:39:32 UTC
Customer response:

"Unfortunately, the workaround of specifying the space character as a field
separator is not valid for our developers because they have to use another
character as a field separator. Do you have another better solution? Is it
possible to obtain in the future an official patch to solve this sort
problem?"
Comment 9 Holger Achtziger 2006-04-10 08:56:15 UTC
what is wrong with:
LC_COLLATE="POSIX"; sort demo
Comment 10 Ray Dassen 2006-04-10 09:25:01 UTC
Quoting the customer's web update to the SR, 2/20/2006 11:03 AM:
`With LC_COLLATE="POSIX" the sort command gives incorrect results when
dealing with characters with spanish accents: Aacute ... for example'
Comment 11 Markus Kuhn 2006-04-10 09:49:37 UTC
Options:

a) Tell the customer that you are sorry that glibc does not at present offer what he expects, and that we are unable to fix this ourselved without breaking compatibility with every other glibc-based distribution. One possible customer-side workaround is to replace SP with NBSP (0xa0) before sorting. NBSP does already seem to get sorted in the way in which the customer expects SP to be sorted.

b) Patch in /usr/share/i18n/locales/iso14651_t1 the line

  <U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>

to something like

  <U0020> <U0020>;<BAS>;<MIN>;IGNORE

to make SP sort like NBSP does already

WARNING: This obviously breaks compatibility with other Linux distributions.

c) Fund a proper research project aimed as cleaning up the mess that the collation implementation and configuration is at present in glibc and POSIX, possibly also investigating user needs and developing a new API for customizing the sorting order at run-time via environment variables and/or new library calls.

The reported propblem is only one symptom of the fact that the collating code (and perhaps even the underlying POSIX spec!) is not really finished and is at present not properly maintained.

Option c) is perhaps what really should be done, but needs far wider discussion (beyond Novell) and escalation to management, because someone will have to spend many weeks (if not months) on sorting this entire issue out properly.
Comment 12 Ray Dassen 2006-04-10 13:17:42 UTC
Thanks for your input, Markus.

I've tried the SP <-> NBSP substition workaround as follows:

	perl -ne "use encoding 'utf-8'; s/ /\xa0/g; print" < input.txt | \
	sort | \
	perl -ne "use encoding 'utf-8'; s/\xa0/ /g; print"

but while this fixes the customer's example (sort "MAS PUJADAS" before
"MASOLIVER"), the behaviour with Mike's example is unexpected:
	ABCD ETX
	ABC DETZ
	AB CDESY
Comment 13 Mike Fabian 2006-04-10 13:58:08 UTC
It seems like the way you use perl to do the conversion is
not correct. Using your perl expression I get:

mfabian@magellan:/tmp$ perl -ne "use encoding 'utf-8'; s/ /\xa0/g; print" < input.txt | hex 
0000  41 42 43 44 ef bf bd 45  54 58 0a 41 42 43 ef bf  ABCD...E TX.ABC..
0010  bd 44 45 54 5a 0a 41 42  ef bf bd 43 44 45 53 59  .DETZ.AB ...CDESY
0020  0a                                                .
mfabian@magellan:/tmp$

i.e. the spaces are converted to "ef bf bd" in UTF-8 which
is U+FFFD (REPLACEMENT CHARACTER).
Comment 14 Markus Kuhn 2006-04-10 14:05:47 UTC
The appropriate Perl way of doing this is more like

  perl -C -pe "s/ /\xa0/g;" | sort | perl -C -pe "s/\xa0/ /g;"

however, the effect remains the same. (Option -C tells Perl to use the input and output encoding according to the locale, as it was the default in Perl 5.8.0 briefly.)
Comment 15 Mike Fabian 2006-04-10 14:18:04 UTC
Markus Kuhn< The appropriate Perl way of doing this is more like

This perl expression works:

mfabian@magellan:/tmp$ hex <input.txt
0000  41 42 43 44 20 45 54 58  0a 41 42 43 20 44 45 54  ABCD ETX .ABC DET
0010  5a 0a 41 42 20 43 44 45  53 59 0a                 Z.AB CDE SY.
mfabian@magellan:/tmp$ perl -C -pe "s/ /\xa0/g;" < input.txt | hex
0000  41 42 43 44 c2 a0 45 54  58 0a 41 42 43 c2 a0 44  ABCD..ET X.ABC..D
0010  45 54 5a 0a 41 42 c2 a0  43 44 45 53 59 0a        ETZ.AB.. CDESY.
mfabian@magellan:/tmp$ perl -C -pe "s/ /\xa0/g;" < input.txt | perl -C -pe "s/\xa0/ /g;" | hex
0000  41 42 43 44 20 45 54 58  0a 41 42 43 20 44 45 54  ABCD ETX .ABC DET
0010  5a 0a 41 42 20 43 44 45  53 59 0a                 Z.AB CDE SY.
mfabian@magellan:/tmp$

Markus Kuhn> however, the effect remains the same

?

For me it sorts as desired:

mfabian@magellan:/tmp$ perl -C -pe "s/ /\xa0/g;" < input.txt | sort | perl -C -pe "s/\xa0/ /g;" 
AB CDESY
ABC DETZ
ABCD ETX
mfabian@magellan:/tmp$

(Locale is es_ES.UTF-8 of course).
Comment 16 Mike Fabian 2006-04-10 14:35:48 UTC
By the way, I think it is a bit weird that the following perl
expressions, which don't use "-C" but set the input and output
encoding explicitly don't work:

mfabian@magellan:/tmp$ perl -ne 'binmode STDIN, ":utf8"; binmode STDOUT, ":utf8"; s/ /\xa0/g; print' < input.txt | perl -ne 'binmode STDIN, ":utf8"; binmode STDOUT, ":utf8"; s/\xa0/ /g; print' | hex
0000  41 42 43 44 c3 82 20 45  54 58 0a 41 42 43 20 44  ABCD.. E TX.ABC D
0010  45 54 5a 0a 41 42 20 43  44 45 53 59 0a           ETZ.AB C DESY.
mfabian@magellan:/tmp$

Replacing ' ' by '\xa0' work correctly but the second perl expression
which should revert it doesn't do it right.

If one uses ' ' (U+00A0) directly in the second perl expression
instead of using the backslash escape sequence '\xa0', it works!:

mfabian@magellan:/tmp$ perl -ne 'binmode STDIN, ":utf8"; binmode STDOUT, ":utf8"; s/ /\xa0/g; print' < input.txt | perl -ne 'binmode STDIN, ":utf8"; binmode STDOUT, ":utf8"; s/ / /g; print' | hex
0000  41 42 43 44 20 45 54 58  0a 41 42 43 c2 a0 44 45  ABCD ETX .ABC..DE
0010  54 5a 0a 41 42 c2 a0 43  44 45 53 59 0a           TZ.AB..C DESY.
mfabian@magellan:/tmp$

I guess that's a bug in perl, isn't it?



Comment 17 Markus Kuhn 2006-04-10 15:22:55 UTC
Re: #15
It works for me in es_ES.UTF-8, but not in en_GB.UTF-8. Any idea?
Re: #16
Put binmode into a BEGIN block, such that it is executed before the <> under -n
Comment 18 Ray Dassen 2006-04-10 15:45:59 UTC
(In reply to comment #14)
> The appropriate Perl way of doing this is more like
> 
>   perl -C -pe "s/ /\xa0/g;" | sort | perl -C -pe "s/\xa0/ /g;"

Thanks; I haven't worked with Unicode in Perl before.

> however, the effect remains the same.

It does the trick in my reference SLES9 system, so I'm now asking the customer whether this is an acceptable workaround for him.
Comment 19 Mike Fabian 2006-04-10 16:01:56 UTC
Markus Kuhn> Re: #16
Markus Kuhn> Put binmode into a BEGIN block, such that it is executed before the <> under -n

Yes, that works:

mfabian@magellan:/tmp$ perl -ne 'BEGIN {binmode STDIN, ":utf8"; binmode STDOUT, ":utf8";} s/ /\xa0/g; print' < input.txt | perl -ne 'binmode STDIN, ":utf8"; binmode STDOUT, ":utf8"; s/\xa0/ /g; print' | hex
0000  41 42 43 44 c3 82 20 45  54 58 0a 41 42 43 20 44  ABCD.. E TX.ABC D
0010  45 54 5a 0a 41 42 20 43  44 45 53 59 0a           ETZ.AB C DESY.
mfabian@magellan:/tmp$

But why does it work without the BEGIN block if I use the character
U+00A0 directly instead of the backslash escape?
Comment 20 Markus Kuhn 2006-04-10 16:49:06 UTC
#19: Because in that case, Perl remains in binary mode and just passes the bytes through as they are, and your locale and source code happen to use the same encoding (here: UTF-8). Therefore, Perl does not have to know what encoding you use. Only when you ask for Unicode character U+00A0 will Perl have to know how to translate that into a byte sequence, and thats where it needs to know about the encoding.

(BTW: I'd rather have this discussion on linux-utf8, such that more people can learn from it, than in a restricted bugzilla. None of these issues are restricted to just SuSE Linux.)
Comment 22 Mike Fabian 2006-04-21 16:50:43 UTC
Ray Dassen> It does the trick in my reference SLES9 system, so I'm now
Ray Dassen> asking the customer whether this is an acceptable
Ray Dassen> workaround for him.

Is the workaround acceptable for the customer? Can we close this bug?

Comment 23 Ray Dassen 2006-04-24 08:11:40 UTC
There has not yet been a customer response to the proposed workaround. I've sent him a "ping" message; hopefully he will respond to that.

I would prefer this bug be kept open until the underlying service request is closed (if the customer doesn't respond, that'll be in two weeks time).
Comment 24 Mike Fabian 2006-04-24 09:11:42 UTC
OK, thank you.
Comment 25 Ray Dassen 2006-05-08 07:59:12 UTC
Update from the customer:

	5/8/2006 09:17:53 AM	Public	Web Update	Done	
	Sorry for I haven't responded to your service request before. The workaround you sent for the sorting behaviour is acceptable for us. Thanks.

So this bug can be de-L3-ed. As the underlying issue is still present, I
would prefer it if this bug were not closed though.
Comment 26 Ray Dassen 2006-05-09 08:47:54 UTC
I have published the workaround in TID 6646, "Spaces are being ignored when
sorting data",
	https://secure-support.novell.com/KanisaPlatform/Publishing/274/6646_f.SAL_Public.html
Comment 27 Mike Fabian 2006-05-09 15:20:18 UTC
Ulrich Drepper just fixed a similar bug in pl_PL.UTF-8 locale, see:

http://sourceware.org/bugzilla/show_bug.cgi?id=388

------- Additional Comments From drepper at redhat dot com  2006-05-01 17:26 -------
I made the change.

Next time if you reply, change the state back from WAITING.  Otherwise the bug
might not show up on lists.

-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |RESOLVED
         Resolution|                            |FIXED
Comment 28 Mike Fabian 2006-05-09 15:31:36 UTC
Apparently this change in glibc was *only* done for pl_PL though, see

http://sourceware.org/cgi-bin/cvsweb.cgi/libc/localedata/locales/?cvsroot=glibc

Probably a similar change should be done for many/most other languages as well.
Comment 29 Ray Dassen 2006-05-09 15:41:58 UTC
(In reply to comment #28)
> Probably a similar change should be done for many/most other languages as
> well.

Looks like it, yes. Quite a few locales have the same specification for
space's weight as es_ES and it is likely wrong or undesirable in all of
them.

libc-cvs/localedata/locales# grep -l 'IGNORE;IGNORE;IGNORE;<U0020>' *
ca_ES
cs_CZ
en_CA
es_ES
es_US
et_EE
fi_FI
hr_HR
hsb_DE
is_IS
iso14651_t1
lt_LT
lv_LV
nb_NO
sl_SI
tr_TR
Comment 30 Mike Fabian 2006-05-09 16:08:10 UTC
Ray Dassen> So this bug can be de-L3-ed.
    
→ removing "L3:" from subject, moving bug to SUSE Linux 10.2.
Comment 31 Vance Baarda 2006-07-13 18:37:28 UTC
Deleting invalid NTS Priority value. (Value needs to be an integer between 1 and
1000, inclusive.)
Comment 32 Mike Fabian 2007-01-17 04:48:03 UTC
Ray Dassen> I would prefer this bug be kept open until the underlying
Ray Dassen> service request is closed (if the customer doesn't
Ray Dassen> respond, that'll be in two weeks time).

OK, how did the customer respond? Can we close this bug now?

→ NEEDINFO Ray Dassen.
Comment 33 Ray Dassen 2007-01-17 07:45:44 UTC
(In reply to comment #32)
> Ray Dassen> I would prefer this bug be kept open until the underlying
> Ray Dassen> service request is closed (if the customer doesn't
> Ray Dassen> respond, that'll be in two weeks time).
> 
> OK, how did the customer respond?

There were no further customer responses after the one in comment #25.

> Can we close this bug now?

Yes; the customer was happy and the issue is now documented for SLES as TID
3006646, "Spaces are being ignored when sorting data".

It would be nice of course if you can work with upstream on fixing this
issue upstream along the lines of the change made for pl_PL already 
(comment #27).
Comment 34 Mike Fabian 2007-02-23 15:31:25 UTC
Well, I have tried to discuss this upstream, see:

http://sourceware.org/bugzilla/show_bug.cgi?id=2648

The reply by Ulrich Drepper was only:

     ------- Additional Comment #9 From Ulrich Drepper  2006-05-10 15:18  [reply] -------

    It's complete BS to say that spaces are mishandled in most locales.  This was
    appropriately researched by the ISO 14651 working group and I trust those people
    more than any random user.

    It is further completely unacceptable to open one bug and complain about a
    million things.

    To get anything changed, you have to provide statemsnts from the language
    authorities about the proposed change.  If you cannot provide this there
    obviously is at least room for discussion and no change is the right approach.

    And then:

     ------- Additional Comment #10 From Ulrich Drepper  2007-02-17 18:44  [reply] -------

    No reply in 9 months.  Closing.

I didn't think that I did complain about a million things, only about sorting
of spaces.

And I thought that Marcus Kuhns comments also show that the sorting of spaces
is most likely not done correctly in glibc.

And that a change like the one requested for Spanish locale
was applied for Polish locale in glibc recently is yet another data
point that something seems to be wrong here. 

I have no idea how to get the discussion started again.

Therefore I cannot do anything else but close this bug as fixed as
our customer has a usable workaround now.




Comment 35 Mike Fabian 2007-02-23 15:50:06 UTC
I have added Petr Baudis <pbaudis@novell.com> to CC:.

Petr is currently maintainer of our glibc package.

Petr, for Czech locale the sorting is most likely wrong as well:

mfabian@magellan:/tmp$ LC_COLLATE=POSIX sort demo
AB CDESY
ABC DETZ
ABCD ETX
mfabian@magellan:/tmp$ LC_COLLATE=cs_CZ.UTF-8  sort demo
AB CDESY
ABCD ETX
ABC DETZ
mfabian@magellan:/tmp$ cat demo
AB CDESY
ABC DETZ
ABCD ETX
mfabian@magellan:/tmp$

Whereas it seems to sort correctly for Polish locale, probably
because we already have the fix from:

http://sourceware.org/bugzilla/show_bug.cgi?id=388

mfabian@magellan:/tmp$ LC_COLLATE=pl_PL.UTF-8  sort demo
AB CDESY
ABC DETZ
ABCD ETX
mfabian@magellan:/tmp$

Petr, if you think you can discuss this upstream better than me,
please try.

Comment 36 Petr Baudis 2007-02-23 16:06:37 UTC
I agree that the "default" sort order seems silly, but obviously we need to prove for every locale we want to change that it's wrong or upstream won't accept fixes. That means referencing national standards for sorting orders, which are mostly proprietary and also probably mostly not in English...
Comment 37 Mike Fabian 2008-04-01 16:53:28 UTC
Arturo Aguilar <aaguilar@novell.com> and
Fernando Herradon <fherradon@novell.com> stumbled on the same
problem (on SLES8/United Linux SP4).

They need to sort Spanish text and need to do this in Spanish locale
to get the ñ sorted correctly. But then the spaces are sorted the
wrong way as discussed in this bug. Using POSIX locale instead
the spaces are sorted correctly but tne ñ is not sorted correctly.

I.e. there is no locale which sorts both the space and the ñ correctly
for Spanish.

Arturo and Fernando, can you point me to an offical standard how
Spanish text should be sorted? Is something like that available
online somewhere?

I would like to add such information to the upstream bug:

http://sourceware.org/bugzilla/show_bug.cgi?id=2648