Bugzilla – Bug 152778
LC_COLLATE=es_ES ignores blanks
Last modified: 2008-04-01 16:53:28 UTC
When LC_COLLATE=es_ES, the sort command ignores spaces in its sorting algorithm, so it sorts MAS PUJADAS, FRANCESC after MASOLIVER GARCIA, JAIME instead of before, even though the comments in /usr/share/i18n/locales/es_ES indicate that the sorting algorithm for this locales should take spaces into account (and sort them before punctuation characters, numbers and letters). This spanish customer is not using LC_COLLATE="POSIX" because the sort command gives incorrect results when dealing with characters with spanish accents so he has to use LC_COLLATE="es_ES.UTF-8" which is ignoring spaces. Even /usr/share/i18n/locales/es_ES states: LC_COLLATE % Base collation scheme: 1994-03-22 % Ordering algorithm: % 1. Spaces and hyphen (but not soft % hyphen) before punctuation % characters, punctuation characters % before numbers, % numbers before letters. I also tested it with every other language setting and the results are always the same: mortlach:~ # export LC_COLLATE="POSIX" mortlach:~ # sort demo AB CDESY ABC DETZ ABCD ETX mortlach:~ # export LC_COLLATE="en_GB.UTF-8" mortlach:~ # sort demo AB CDESY ABCD ETX ABC DETZ mortlach:~ # export LC_COLLATE="de_DE.UTF-8" mortlach:~ # sort demo AB CDESY ABCD ETX ABC DETZ So the question is why LC_COLLATE="POSIX" behaves differently to any other language setting, if this is a feature where is it documented and why is it so? It doesn't make sence that LC_COLLATE="POSIX" behaves different to the English settings (UK & US) which on the other hand behave exactly the same way as any other language setting so there must be a reason why this is so
reassigning, Mike, could you take look?
Glibc implements a 4-pass sorting algorithm, something like the Unicode Collation Algorithm defined at http://www.unicode.org/reports/tr10/ or equivalently the International Standard Ordering defined in ISO 14651. The SPACE is not ignored, it affects the sorting order only with lower priority than - the base characters - accents - whether base characters are uppercase or lower case At level 4, space is treated like punctuation. The Unicode sorting algorithm has lots of options. If you look at http://www.unicode.org/reports/tr10/#Variable_Weighting you will see that variable weighting options are avaliable for characters such as SPACE. Perhaps the UTF-8 locales were configured to use something equivalent to the "blanked" option, whereas what the user expects here is the "non-ignorable" option? It is up to the locale designer to chose these options, and I suspect the necessary discussion on which options are best here has never taken place. The culprit is probably in the file /usr/share/i18n/locales/iso14651_t1 the line <U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP> which says that SPACE is sorted at level 4 only, i.e. with lowest priority. I don't think this is a particularly good choice. File format spec: http://www.cl.cam.ac.uk/~mgk25/volatile/ISO-14652.pdf People like Ulrich Drepper, Alain LaBonté, Keld J. Simonsen would know more on the origins of this.
If the customer needs this only for the "sort" command and space is the only character which causes a problem, she can use the following workaround: mfabian@magellan:~/test-texts$ LANG=es_ES.UTF-8 sort -t ' ' -k 1,1 -k 2,2 -k 3,3 demo AB CDESY ABC DETZ ABCD ETX mfabian@magellan:~/test-texts$ i.e. specify the space character as a field separator and then list all fields as sort keys.
Is the workaround described in comment #5 a sufficiently good solution for the customer? Or is it necessary to fix this problem in glibc *now*?
we will check with the customer.
Customer response: "Unfortunately, the workaround of specifying the space character as a field separator is not valid for our developers because they have to use another character as a field separator. Do you have another better solution? Is it possible to obtain in the future an official patch to solve this sort problem?"
what is wrong with: LC_COLLATE="POSIX"; sort demo
Quoting the customer's web update to the SR, 2/20/2006 11:03 AM: `With LC_COLLATE="POSIX" the sort command gives incorrect results when dealing with characters with spanish accents: Aacute ... for example'
Options: a) Tell the customer that you are sorry that glibc does not at present offer what he expects, and that we are unable to fix this ourselved without breaking compatibility with every other glibc-based distribution. One possible customer-side workaround is to replace SP with NBSP (0xa0) before sorting. NBSP does already seem to get sorted in the way in which the customer expects SP to be sorted. b) Patch in /usr/share/i18n/locales/iso14651_t1 the line <U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP> to something like <U0020> <U0020>;<BAS>;<MIN>;IGNORE to make SP sort like NBSP does already WARNING: This obviously breaks compatibility with other Linux distributions. c) Fund a proper research project aimed as cleaning up the mess that the collation implementation and configuration is at present in glibc and POSIX, possibly also investigating user needs and developing a new API for customizing the sorting order at run-time via environment variables and/or new library calls. The reported propblem is only one symptom of the fact that the collating code (and perhaps even the underlying POSIX spec!) is not really finished and is at present not properly maintained. Option c) is perhaps what really should be done, but needs far wider discussion (beyond Novell) and escalation to management, because someone will have to spend many weeks (if not months) on sorting this entire issue out properly.
Thanks for your input, Markus. I've tried the SP <-> NBSP substition workaround as follows: perl -ne "use encoding 'utf-8'; s/ /\xa0/g; print" < input.txt | \ sort | \ perl -ne "use encoding 'utf-8'; s/\xa0/ /g; print" but while this fixes the customer's example (sort "MAS PUJADAS" before "MASOLIVER"), the behaviour with Mike's example is unexpected: ABCD ETX ABC DETZ AB CDESY
It seems like the way you use perl to do the conversion is not correct. Using your perl expression I get: mfabian@magellan:/tmp$ perl -ne "use encoding 'utf-8'; s/ /\xa0/g; print" < input.txt | hex 0000 41 42 43 44 ef bf bd 45 54 58 0a 41 42 43 ef bf ABCD...E TX.ABC.. 0010 bd 44 45 54 5a 0a 41 42 ef bf bd 43 44 45 53 59 .DETZ.AB ...CDESY 0020 0a . mfabian@magellan:/tmp$ i.e. the spaces are converted to "ef bf bd" in UTF-8 which is U+FFFD (REPLACEMENT CHARACTER).
The appropriate Perl way of doing this is more like perl -C -pe "s/ /\xa0/g;" | sort | perl -C -pe "s/\xa0/ /g;" however, the effect remains the same. (Option -C tells Perl to use the input and output encoding according to the locale, as it was the default in Perl 5.8.0 briefly.)
Markus Kuhn< The appropriate Perl way of doing this is more like This perl expression works: mfabian@magellan:/tmp$ hex <input.txt 0000 41 42 43 44 20 45 54 58 0a 41 42 43 20 44 45 54 ABCD ETX .ABC DET 0010 5a 0a 41 42 20 43 44 45 53 59 0a Z.AB CDE SY. mfabian@magellan:/tmp$ perl -C -pe "s/ /\xa0/g;" < input.txt | hex 0000 41 42 43 44 c2 a0 45 54 58 0a 41 42 43 c2 a0 44 ABCD..ET X.ABC..D 0010 45 54 5a 0a 41 42 c2 a0 43 44 45 53 59 0a ETZ.AB.. CDESY. mfabian@magellan:/tmp$ perl -C -pe "s/ /\xa0/g;" < input.txt | perl -C -pe "s/\xa0/ /g;" | hex 0000 41 42 43 44 20 45 54 58 0a 41 42 43 20 44 45 54 ABCD ETX .ABC DET 0010 5a 0a 41 42 20 43 44 45 53 59 0a Z.AB CDE SY. mfabian@magellan:/tmp$ Markus Kuhn> however, the effect remains the same ? For me it sorts as desired: mfabian@magellan:/tmp$ perl -C -pe "s/ /\xa0/g;" < input.txt | sort | perl -C -pe "s/\xa0/ /g;" AB CDESY ABC DETZ ABCD ETX mfabian@magellan:/tmp$ (Locale is es_ES.UTF-8 of course).
By the way, I think it is a bit weird that the following perl expressions, which don't use "-C" but set the input and output encoding explicitly don't work: mfabian@magellan:/tmp$ perl -ne 'binmode STDIN, ":utf8"; binmode STDOUT, ":utf8"; s/ /\xa0/g; print' < input.txt | perl -ne 'binmode STDIN, ":utf8"; binmode STDOUT, ":utf8"; s/\xa0/ /g; print' | hex 0000 41 42 43 44 c3 82 20 45 54 58 0a 41 42 43 20 44 ABCD.. E TX.ABC D 0010 45 54 5a 0a 41 42 20 43 44 45 53 59 0a ETZ.AB C DESY. mfabian@magellan:/tmp$ Replacing ' ' by '\xa0' work correctly but the second perl expression which should revert it doesn't do it right. If one uses ' ' (U+00A0) directly in the second perl expression instead of using the backslash escape sequence '\xa0', it works!: mfabian@magellan:/tmp$ perl -ne 'binmode STDIN, ":utf8"; binmode STDOUT, ":utf8"; s/ /\xa0/g; print' < input.txt | perl -ne 'binmode STDIN, ":utf8"; binmode STDOUT, ":utf8"; s/ / /g; print' | hex 0000 41 42 43 44 20 45 54 58 0a 41 42 43 c2 a0 44 45 ABCD ETX .ABC..DE 0010 54 5a 0a 41 42 c2 a0 43 44 45 53 59 0a TZ.AB..C DESY. mfabian@magellan:/tmp$ I guess that's a bug in perl, isn't it?
Re: #15 It works for me in es_ES.UTF-8, but not in en_GB.UTF-8. Any idea? Re: #16 Put binmode into a BEGIN block, such that it is executed before the <> under -n
(In reply to comment #14) > The appropriate Perl way of doing this is more like > > perl -C -pe "s/ /\xa0/g;" | sort | perl -C -pe "s/\xa0/ /g;" Thanks; I haven't worked with Unicode in Perl before. > however, the effect remains the same. It does the trick in my reference SLES9 system, so I'm now asking the customer whether this is an acceptable workaround for him.
Markus Kuhn> Re: #16 Markus Kuhn> Put binmode into a BEGIN block, such that it is executed before the <> under -n Yes, that works: mfabian@magellan:/tmp$ perl -ne 'BEGIN {binmode STDIN, ":utf8"; binmode STDOUT, ":utf8";} s/ /\xa0/g; print' < input.txt | perl -ne 'binmode STDIN, ":utf8"; binmode STDOUT, ":utf8"; s/\xa0/ /g; print' | hex 0000 41 42 43 44 c3 82 20 45 54 58 0a 41 42 43 20 44 ABCD.. E TX.ABC D 0010 45 54 5a 0a 41 42 20 43 44 45 53 59 0a ETZ.AB C DESY. mfabian@magellan:/tmp$ But why does it work without the BEGIN block if I use the character U+00A0 directly instead of the backslash escape?
#19: Because in that case, Perl remains in binary mode and just passes the bytes through as they are, and your locale and source code happen to use the same encoding (here: UTF-8). Therefore, Perl does not have to know what encoding you use. Only when you ask for Unicode character U+00A0 will Perl have to know how to translate that into a byte sequence, and thats where it needs to know about the encoding. (BTW: I'd rather have this discussion on linux-utf8, such that more people can learn from it, than in a restricted bugzilla. None of these issues are restricted to just SuSE Linux.)
Ray Dassen> It does the trick in my reference SLES9 system, so I'm now Ray Dassen> asking the customer whether this is an acceptable Ray Dassen> workaround for him. Is the workaround acceptable for the customer? Can we close this bug?
There has not yet been a customer response to the proposed workaround. I've sent him a "ping" message; hopefully he will respond to that. I would prefer this bug be kept open until the underlying service request is closed (if the customer doesn't respond, that'll be in two weeks time).
OK, thank you.
Update from the customer: 5/8/2006 09:17:53 AM Public Web Update Done Sorry for I haven't responded to your service request before. The workaround you sent for the sorting behaviour is acceptable for us. Thanks. So this bug can be de-L3-ed. As the underlying issue is still present, I would prefer it if this bug were not closed though.
I have published the workaround in TID 6646, "Spaces are being ignored when sorting data", https://secure-support.novell.com/KanisaPlatform/Publishing/274/6646_f.SAL_Public.html
Ulrich Drepper just fixed a similar bug in pl_PL.UTF-8 locale, see: http://sourceware.org/bugzilla/show_bug.cgi?id=388 ------- Additional Comments From drepper at redhat dot com 2006-05-01 17:26 ------- I made the change. Next time if you reply, change the state back from WAITING. Otherwise the bug might not show up on lists. -- What |Removed |Added ---------------------------------------------------------------------------- Status|WAITING |RESOLVED Resolution| |FIXED
Apparently this change in glibc was *only* done for pl_PL though, see http://sourceware.org/cgi-bin/cvsweb.cgi/libc/localedata/locales/?cvsroot=glibc Probably a similar change should be done for many/most other languages as well.
(In reply to comment #28) > Probably a similar change should be done for many/most other languages as > well. Looks like it, yes. Quite a few locales have the same specification for space's weight as es_ES and it is likely wrong or undesirable in all of them. libc-cvs/localedata/locales# grep -l 'IGNORE;IGNORE;IGNORE;<U0020>' * ca_ES cs_CZ en_CA es_ES es_US et_EE fi_FI hr_HR hsb_DE is_IS iso14651_t1 lt_LT lv_LV nb_NO sl_SI tr_TR
Ray Dassen> So this bug can be de-L3-ed. → removing "L3:" from subject, moving bug to SUSE Linux 10.2.
Deleting invalid NTS Priority value. (Value needs to be an integer between 1 and 1000, inclusive.)
Ray Dassen> I would prefer this bug be kept open until the underlying Ray Dassen> service request is closed (if the customer doesn't Ray Dassen> respond, that'll be in two weeks time). OK, how did the customer respond? Can we close this bug now? → NEEDINFO Ray Dassen.
(In reply to comment #32) > Ray Dassen> I would prefer this bug be kept open until the underlying > Ray Dassen> service request is closed (if the customer doesn't > Ray Dassen> respond, that'll be in two weeks time). > > OK, how did the customer respond? There were no further customer responses after the one in comment #25. > Can we close this bug now? Yes; the customer was happy and the issue is now documented for SLES as TID 3006646, "Spaces are being ignored when sorting data". It would be nice of course if you can work with upstream on fixing this issue upstream along the lines of the change made for pl_PL already (comment #27).
Well, I have tried to discuss this upstream, see: http://sourceware.org/bugzilla/show_bug.cgi?id=2648 The reply by Ulrich Drepper was only: ------- Additional Comment #9 From Ulrich Drepper 2006-05-10 15:18 [reply] ------- It's complete BS to say that spaces are mishandled in most locales. This was appropriately researched by the ISO 14651 working group and I trust those people more than any random user. It is further completely unacceptable to open one bug and complain about a million things. To get anything changed, you have to provide statemsnts from the language authorities about the proposed change. If you cannot provide this there obviously is at least room for discussion and no change is the right approach. And then: ------- Additional Comment #10 From Ulrich Drepper 2007-02-17 18:44 [reply] ------- No reply in 9 months. Closing. I didn't think that I did complain about a million things, only about sorting of spaces. And I thought that Marcus Kuhns comments also show that the sorting of spaces is most likely not done correctly in glibc. And that a change like the one requested for Spanish locale was applied for Polish locale in glibc recently is yet another data point that something seems to be wrong here. I have no idea how to get the discussion started again. Therefore I cannot do anything else but close this bug as fixed as our customer has a usable workaround now.
I have added Petr Baudis <pbaudis@novell.com> to CC:. Petr is currently maintainer of our glibc package. Petr, for Czech locale the sorting is most likely wrong as well: mfabian@magellan:/tmp$ LC_COLLATE=POSIX sort demo AB CDESY ABC DETZ ABCD ETX mfabian@magellan:/tmp$ LC_COLLATE=cs_CZ.UTF-8 sort demo AB CDESY ABCD ETX ABC DETZ mfabian@magellan:/tmp$ cat demo AB CDESY ABC DETZ ABCD ETX mfabian@magellan:/tmp$ Whereas it seems to sort correctly for Polish locale, probably because we already have the fix from: http://sourceware.org/bugzilla/show_bug.cgi?id=388 mfabian@magellan:/tmp$ LC_COLLATE=pl_PL.UTF-8 sort demo AB CDESY ABC DETZ ABCD ETX mfabian@magellan:/tmp$ Petr, if you think you can discuss this upstream better than me, please try.
I agree that the "default" sort order seems silly, but obviously we need to prove for every locale we want to change that it's wrong or upstream won't accept fixes. That means referencing national standards for sorting orders, which are mostly proprietary and also probably mostly not in English...
Arturo Aguilar <aaguilar@novell.com> and Fernando Herradon <fherradon@novell.com> stumbled on the same problem (on SLES8/United Linux SP4). They need to sort Spanish text and need to do this in Spanish locale to get the ñ sorted correctly. But then the spaces are sorted the wrong way as discussed in this bug. Using POSIX locale instead the spaces are sorted correctly but tne ñ is not sorted correctly. I.e. there is no locale which sorts both the space and the ñ correctly for Spanish. Arturo and Fernando, can you point me to an offical standard how Spanish text should be sorted? Is something like that available online somewhere? I would like to add such information to the upstream bug: http://sourceware.org/bugzilla/show_bug.cgi?id=2648