[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference thebay::joyoflex

Title:	The Joy of Lex
Notice:	A Notes File even your grammar could love
Moderator:	THEBAY::SYSTEM

Created:	Fri Feb 28 1986
Last Modified:	Mon Jun 02 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1192
Total number of notes:	42769

1041.0. "Alphabetical ordering: marginal cases" by KETJE::HAENTJENS (Beware of Counterfeit) Mon Apr 19 1993 07:35

On Andrew Kenah's request, and after some inspiring mail exchanges 
with Norman Diamond, here's a separate topic on alphabetical ordering 
as offspring from note 1040.

With my excuses to joyoflexers who find this subject boring and/or 
misplaced in this notesfile. To those who don't, I welcome all 
examples, especially those accompanied by some reasons why.

Ren� Haentjens.


ASCII sort is no longer accepted as 'good enough' when it comes to 
producing lists in alphabetical order for human reference. That is one 
of the reasons why the NCS utility was introduced in VMS a few years 
ago. With NCS, one can build collating definitions that are quite a 
good fit to specific ways of ordering used around the globe.

Since then, an even better method than the one implemented in NCS has 
been found. Multilevel ordering is still very efficient and it can do 
quite a good job. This method is being implemented in XPG4 and POSIX 
compliant systems.

There are now Work Items in an ISO committee and in a CEN committee 
(CEN is the European Standardization Committee) to find multilevel 
ordering rules that would be acceptable in an international and 
multicultural context.

One cannot unfortunately devise a method that would order as in Spain 
and as in England simultaneously. This is because the CH is treated as 
a single letter in Spain. There are more such incompatibilities, for 
example the a-ring in Scandinavia and elsewhere.

This is where, if people want a multicultural ordering, they will have 
to reach an agreement. Such multicultural ordering does not have the 
pretention to replace language- or country-specific ordering where it 
makes no or little sense to do so, but it might be useful in contexts 
where the target user community is multicultural.

Of course, everyone agrees that for Latin-written words, A is followed 
by B etc. It is more difficult to see what is the 'best possible' 
ordering in some marginal cases.

One of those marginal cases is where two words are identical except 
that one has a hyphen and the other has not (coop and co-op) or where 
two words with a hyphen are identical except for the place of the 
hyphen (o-ring and or-ing). (This is why I started note 1040.)

The hyphen is less significant than upper/lowercase distinctions in 
some recent standards on ordering such as the Canadian and the Swedish 
one. However, I have until now not found a convincing example why it 
should be that way. In fact, I have found some examples that seem to 
suggest it should be the other way around. Here's one of them:

'Sanssouci' is a castle near Potsdam, Germany and 'Sans-Souci' is a 
historical place on Haiti. Ordering in Canada would be:

	Sanssouci, Sans-Souci, SANSSOUCI, SANS-SOUCI.

In my opinion, it is better to keep upper/lowercase variants together, 
as in:
	Sanssouci, SANSSOUCI, Sans-Souci, SANS-SOUCI.

The same problem arises with words such as 'unionised' (see note 
1040). The same problem arises with other special characters such as 
apostrophe and space. For example, in South Carolina there is a city 
called 'Sans Souci' (in two words, with a space).

If you're interested in knowing more about alphabetical ordering and 
about some more marginal cases, you might find my report on the 
subject entertaining to read. You can copy it (specify full filespec) 
from KETJE::DISK$USER_2:[HAENTJENS.PU]ALPHA3.PS

Ren�.

T.R	Title	User	Personal Name	Date	Lines
1041.1		VMSMKT::KENAH	There are no mistakes in Love...	`Mon Apr 19 1993 08:40`	6
	Do any of the current standards deal with "special" letters (That is, those letters beyond the simple 26 used by British and American English)? For example, �, �, ll, accented vowels, ligatures, etc. Where do they fit into ordering sequences? andrew
1041.2	"1812" par l'duc de la terHorst...	TLE::JBISHOP		`Mon Apr 19 1993 08:52`	14
	It's worse than you think. I saw a library's ordering rules (in a C.S. article). There were rules about sorting books with titles containing numbers (spell it out), and books in foreign languages containing numbers (spell it out in the foreign language), and titles consisting only of numbers, and books in non-Latin scripts, and names with prefixes separated by a space (de, von), and names with prefixes not separated by a space (e.b. terHorst) and on and on. If I can remember the article, I'll post a reference. -John Bishop
1041.3	Was there any life before computers?	KETJE::HAENTJENS	Beware of Counterfeit	`Tue Apr 20 1993 04:58`	18
	To .1: Yes, many existing (country) standards on ordering have rules about symbols beyond the Latin letters A-Z. Just as an example, the Belgian standard on Alphabetical Ordering (1959!) has rules about the Dutch ij, the German �, about �, �, �, �, � (with my excuses to people who are not using Latin-1 for displaying this note), about the symbol &, about numbers and many other things. Don't forget that alphabetical ordering is older than computers! To .1 and .2: Many existing standards have been designed for explaining to people how they should put things in alphabetical order, very few take into account what computers can do with some level of efficiency. The standardization efforts in ISO and CEN concentrate on "lexical" or "mechanical" ordering. Not only because it is easier for computers, but also because rules based on knowledge or on linguistics, when used in a multicultural context, make it more difficult for people to find entries in a list, instead of making it easier as originally meant. Ren�.
1041.4	I'm a rational sort	RAGMOP::T_PARMENTER	Human. All too human.	`Tue Apr 20 1993 06:32`	21
	Isn't this discussion kind of culturally biased? The CH in Spanish isn't "treated" as a separate letter in the alphabet; it is a separate letter in the Spanish alphabet, as are LL, RR, and also �. Spanish children's blocks come in sets of 30, not 26. Also, in Norwegian the �, �, �, � are not "special" characters, "A with a ring", "O with a slash"; they are full fledged members of the alphabet. Norwegian and Spanish, and Finnish and Hungarian, etc., are languages with alphabets. So is English. I once tried to explain the meaning of the word "tilde" to a Spanish friend, but he just couldn't see the � as an "N with tilde"; it was just an � to him. Incidentally, the Spanish sort their alphabet "rationally", with the CH following the C and the LL following the L, etc. On the other hand, the Norwegians sort their alphabet "rationally", with the A-Z in the first 26 places and the others following in "order" at the end.
1041.5		SMURF::BINDER	Deus tuus tibi sed deus meus mihi	`Tue Apr 20 1993 09:26`	29
	Some varieties of internationalization support in UNIX� software such as DEC OSF/1� use a localization environment variable (a logical for you VMS types, but it's not quite the same) called LC_COLLATE that controls how sorting is to be done. I quote from the DEC OSF/1 Guide to Programming Support Tools: A character range can include a multicharacter collating element enclosed within bracket-period delimiters ([. and .]). These "collating symbols" are necessary for languages that treat some strings as individual collating elements. For example, in Spanish, the strings ch and ll each are collating symbols (that is, the Spanish primary sort order is a, b, c, ch, d,..., k, l, ll, m, ...). The bracket-period delimiters in the RE syntax distinguish multicharacter collating elements from a list of the individual characters that make up the element. When using Spanish collation rules, [[.ch.]] is treated as an RE matching the sequence ch, while [ch] is treated as an RE matching c or h. In addition, [a-[.ch.]] matches a, b, c, and ch. So there is some sanity in the computer world. -dick ---- � UNIX is a registered trademark of UNIX Systems Laboratories, Inc. � Open Software Foundation, OSF, OSF/1, OSF/Motif, and Motif are trademarks of the Open Software Foundation, Inc.
1041.6		VMSMKT::KENAH	There are no mistakes in Love...	`Tue Apr 20 1993 10:20`	10
	>I once tried to explain the meaning of the word "tilde" to a Spanish >friend, but he just couldn't see the � as an "N with tilde"; it was >just an � to him. Makes sense to me -- in English, it would be like trying to explain Q as "O with a squiggly thing on the bottom." Or "B" as "P with an extra bump on the side." Nope, they're just "B & Q." andrew
1041.7		CALS::DESELMS	Opera r�lz	`Tue Apr 20 1993 11:32`	5
	RE: -1 Great example... - Jim
1041.8	Me too	AUSSIE::WHORLOW	Bushies do it for FREE!	`Tue Apr 20 1993 15:32`	10
	G'daym, Minor rathole .. There is 'Sans Souci' in Australia... It's a suburb of Sydney... derek PS where would that fit in the Sanssouci / SANSSOUCI /Sans-Souci... scheme?
1041.9		JIT081::DIAMOND	Pardon me? Or must I be a criminal?	`Tue Apr 20 1993 17:56`	9
	Re .5 >>A character range can include a multicharacter collating element >>enclosed within bracket-period delimiters ([. and .]). [...] >>When using Spanish collation rules, [[.ch.]] is treated as an RE >>matching the sequence ch, while [ch] is treated as an RE matching >>c or h. In addition, [a-[.ch.]] matches a, b, c, and ch. How do they do it in a character set that doesn't have [ and ] ?
1041.10	Difference between tilde and squiggly thing	KETJE::HAENTJENS	Beware of Counterfeit	`Wed Apr 21 1993 04:10`	21
	Re .4 Of course I'm culturally biased: I have grown up in some specific culture, how could I be unbiased! But my words "CH is treated as a letter" were not meant to convey anything negative or a "looking down" attitude. Re .4 .6 The difference between � and q is that the first one is treated as n with tilde outside Spain, whereas no alphabet considers q as o with squiggly thing, as far as I know. I mean: 'do�a' is in between 'don' and 'donate' in an English dictionary, not between 'donsie' and 'doodle' and similarly for other European language dictionaries. You can also lookup 'ca�on', 'se�or' and 'se�orita'. For those languages that use the Q, it is always a separate letter. You can also look at it from a historical perspective. The Q derives straight from the 3000 year old Semitic alphabet, whereas the tilde is only a few hundred years old and it was at some point in time added to the N to make a new letter. Ren�.
1041.11	My rathole or yours?	FORTY2::KNOWLES	DECspell snot awl ewe kneed	`Wed Apr 21 1993 05:50`	20
	� You can also look at it from a historical perspective. The Q �derives straight from the 3000 year old Semitic alphabet, whereas the �tilde is only a few hundred years old and it was at some point in time �added to the N to make a new letter. Indeed. There is a jolly enticing rathole opportunity here: the � was a medi�val transcription shortcut where there were two NNs in the source word - cannon =� ca��n. I wonder if this introduced the � as a free-standing letter which was then used where there was no manuscript involved and the root had an -NI- (as in se�or and many other cases). I have some early Spanish texts at home, and will check whether � co-existed with -ni- for a time. Stop me if I'm boring you... But whatever the history, the fact now is that for someone from Spain n and � are wholly discrete. Similarly (not a similar phenomenon, but a similar lack of historical awareness) a modern Italian will pronounce PREZZO with a -ts- and MEZZO with a -dz- because that's the right way, rather than because of Latin PRETIUM and MEDIUM. b
1041.12	A�other r�thol�	KETJE::HAENTJENS	Beware of Counterfeit	`Wed Apr 21 1993 07:52`	10
	... and to make the issue even more complicated: sometimes letters are considered to be different, but nevertheless ordered together, at least in the first ordering level. For example, many French speaking people will argue that � and � are different letters, but all French dictionaries consider them equivalent for the first ordering level. Similarly, u and � are not quite the same in Germany and there are two ordering methods, one of which considers u and � equivalent for the first level (- the other method orders � as if it were u+e). Ren�.
1041.13		VMSMKT::KENAH	blah blah blah GINGER	`Wed Apr 21 1993 10:43`	15
	Is this an accurate synopsis? 1. Different European languages have developed ordering rules that are internally consistent. 2. You are trying to develop more general ordering rules, rules that incorporate different language's rules while maintaining internal consistency as well as consistency with each individual language. In addition, it sounds like you're trying to make sense between similar but distinct words and word groupings. 3. Finally, the ordering scheme you develop must be implemented on a computer, since computers are valuable tools for tasks like ordering. Do any of the existing standards (ISO, XPG) deal with this topic?
1041.14		VMSMKT::KENAH	blah blah blah GINGER	`Wed Apr 21 1993 13:11`	4
	I re-read .0 and see that it states POSIX compiliant systems support Multilevel ordering -- which POSIX standard is it a part of? andrew
1041.15	9945-2.2	KETJE::HAENTJENS	Beware of Counterfeit	`Thu Apr 22 1993 02:34`	9
	Andrew, your summary in .13 is very good! The only thing which I will not reach, is consistency with each individual language. This will only be partial consistency with individual languages. POSIX is, I believe, ISO/IEC 9945-2.2 Shell and Utilities. The XPG counterpart can be found in 'X/Open CAE Specification,System Interface Definitions, Issue 4' ISBN:1-872630-46-4 or X/Open Doc.N� C204. Ren�.
1041.16		VMSMKT::KENAH	blah blah blah GINGER	`Thu Apr 22 1993 06:21`	4
	Thanks for the POSIX and XPG references -- I'll think I'll check 'em out (I believe one of my colleagues has a copy of XPG4). andrew
1041.17		NOVA::FISHER	DEC Rdb/Dinosaur	`Thu Apr 22 1993 06:31`	33
	Q: Different European languages have developed ordering rules that are internally consistent. It is my understanding that there are some internal differences. I think I was told that there are 3 ways of sorting German, one was called a telephone book sort, another was a diciotnary sort, I forget the third. Did .1 say that RR was a different letter in spanish? While the ordering of Sans Souci SANS SOUCI Sanssouci SANSSOUCI Sans-souci SANS-SOUCI relative to each other are important, it must also be noted whether SANSCRIT and sanserif are allowed to interrupt the sequence. Yet another aside occurs to me: When ordering words with letters containing diacriticals, most current algorithms -- and therefore those of VMS SORT and Rdb -- go left to right, for example: with odering being (I think) e � � � �, one would order a doublet as ee e� e� �e but we received an inquiry from a salesman in Canada concerning doing it from right to left as in: ee �e e� e�. [these actual examples may never occur but they are the same as, say, bete b�te bet�.] Wel, enough meandering... ed
1041.18	Ordering with Sanscrit	KETJE::HAENTJENS	Beware of Counterfeit	`Thu Apr 22 1993 08:11`	17
	Re .17 These examples do occur. See my report (filespec in .0). The backwards check is now part of a Canadian Standard, that's why you got the inquiry. It cannot be implemented with VMS NCS, but it can be implemented with POSIX LC_COLLATE. (See earlier reply.) I had heard about CH, LL and � in Spanish, but not about RR... The order, in my opinion, should be: SANSCRIT, sanserif, {all forms of Sans Souci}, santon, SAP. Re .16 I just read in the NOTED::WORLDWIDE notesfile that there is a document about XPG4 in I18N::ISE$PUBLIC:[INFO]XPG4_FINAL.PS. Ren�.
1041.19	%^}	VMSMKT::KENAH	blah blah blah GINGER	`Thu Apr 22 1993 13:47`	4
	SANSCRIT would probably wind up somewhere else in American English - that's because the usual transliteration is SANSKRIT. andrew
1041.20	let those R's rip	RAGMOP::T_PARMENTER	Human. All too human.	`Tue Apr 27 1993 06:22`	14
	RR is a separate letter in Spanish, but, unlike all the other "letters not in the English alphabet", it never appears in the initial position, and therefore has no separate heading in the dictionary. Someone more knowledgeable will have to help me out here so far as what this means, but the R in an initial position is normally pronounced like the RR in an interior position, with a trill, while the R in an interior position gets one "tap", similar to the "dd" in English "ladder". Letter names: C = ce CH = che L = ele LL = elle N = ene � = e�e R = ere RR = erre
1041.21	ARR, Matey!	CALS::DESELMS	Opera r�lz	`Tue Apr 27 1993 06:56`	6
	A "flipped R", is just like a trilled R, except that instead of the tongue tapping the roof of your mouth a bunch of times, it only taps the roof of the mouth once. It is indeed exactly the same as "dd" in "ladder". Pronounce Spanish with an American ARR and they'll laugh in your face. - Jim
1041.22		NOVA::FISHER	DEC Rdb/Dinosaur	`Thu Apr 29 1993 07:11`	6
	But rr in Spanish also has no special collation rule [that I have seen]. Is rr collated after rz? ed
1041.23		NOTIME::SACKS	Gerald Sacks ZKO2-3/N30 DTN:381-2085	`Thu Apr 29 1993 13:46`	3
	re .20: It's an alveolar flap.
1041.24	Knuth, of course	TLE::JBISHOP		`Fri Aug 06 1993 11:58`	7
	re .2 See Knuth's _Sorting_and_Searching_ (his volume 3), pp 7..9 for some library sorting rules, e.g. "Ignore initial articles, unless not in nominative case...". -John Bishop
1041.25		VMSMKT::KENAH	I��-I {��} {��^} {^�^} {��} {��}	`Fri Aug 06 1993 12:29`	5
	A question came up in another conference -- does Digital support Cyrillic alphabets? I'm embarrassed to ask this, because I don't know whether ISO Latin-1 includes Cyrillic alphabets. (We do support ISO Latin-1, don't we?)
1041.26	Nope.	SMURF::BINDER	Sapientia Nulla Sine Pecunia	`Fri Aug 06 1993 12:41`	17
	Re .25 > I'm embarrassed to ask this, because I don't know whether ISO Latin-1 > includes Cyrillic alphabets. It doesn't. Producing International Products -- Software handbook (Identification Number A-MN-ELEN467-00-0 Rev B) ...says this: The ISO Latin Alphabet No. 1 has been developed by the International Organization for Standards (ISO) as the standard character set for the Western European languages. It will eventually supersede the DEC Multinational Character Set. Further ISO character sets are being developed to cover European languages not based on the Latin Alphabet.
1041.27		VMSMKT::KENAH	I��-I {��} {��^} {^�^} {��} {��}	`Fri Aug 06 1993 13:10`	7
	Thanks. So: does Digital support Cyrillic alphabets? Also: Does Digital support ISO Latin-1? andrew
1041.28		REGENT::BROOMHEAD	Don't panic -- yet.	`Fri Aug 06 1993 13:21`	8
	ISO Latin-1 is Digital's default character set -- so, yes, we support it. ISO Latin-Cyrillic (ISO 8859-5 (which is not ISO Latin-5)) is provided on a few of our printers (dot matrix ones) and can be added via a cartridge on our ANSI laser printers. So, yes, we support it. Ann B.
1041.29		VMSMKT::KENAH	I��-) (��) {��^} {^�^} {��} /��\	`Fri Aug 06 1993 14:27`	9
	Thank you, Ann. I didn't realize ISO Latin-1 was our default, although (based on Dick's description) it's obvious. How about Cyrillic support at the user-interface level? andrew P.S. I'm tracking this question through another path within Digital; should I get an expanded answer, I'll post it here.
1041.30		NRSTA2::KALIKOW	Supplely Chained	`Fri Aug 06 1993 14:40`	5
	Hey andrew -- Keep us posted on whether you get the answer thru "official" or "other" channels faster than this employee-interest notesfile... It'd be great if we could get you out of the BOX faster... :-)
1041.31		ISTWI1::KINACI	Walk thru this world	`Mon Aug 09 1993 05:47`	16
	I think Cyrillic is ISO-Latin 2 is it not? I know there is some Cyrillic support out there and there is more to come once the Fonts acquired from Monotype go into distribution. I've been informed that we will have a wide scale test for the various fonts. I will be working on testing ISO-Latin 5 for Turkey, for example. I know that there is a Cyrillic version of DECterm. Hold on, I am not sure if we are talking full UI localization or if there is just character set support. But the latter definitely exists. I know there was work being done to get EPROMs which support Cyrillic for VT420 type terminals. I believe this has been completed. I also know that the Cyrillic version of ALL-IN-1 V3.0 should be shipping soon. Suz
1041.32		VMSMKT::KENAH	I��-) (��) {��^} {^�^} {��} /��\	`Mon Aug 09 1993 06:03`	6
	So far, the clear winner is through Employee-Interest conferences; Of course the informal channels have given me pointers to more formal channels, so the lines are getting blurred. Of course without the informal channels, I never would have found the formal channels...
1041.33	Who can answer Andrew's question?	REGENT::BROOMHEAD	Don't panic -- yet.	`Mon Aug 09 1993 09:49`	14
	Suz, Nope, it's ISO Latin-Cyrillic, with no number in sight. Andrew, "How about Cyrillic support at the user-interface level?" I can't answer that. All I can tell you is I have the Cyrillic fonts from Monotype that Suzan mentioned, but I don't know who is to pay to make them into cartridges or soft fonts, or even which fonts (type- faces) I should concentrate on. Ann B.
1041.34		ISTWI1::KINACI	Walk thru this world	`Mon Aug 09 1993 11:24`	20
	Hi Ann! Nice to run into you here. RE the fonts. You probably know that Israel is going to be running a Fonts Q.A. Project in early September, where we will all get to test our own fonts. I suspect that will be when we will get a broader picture of what is out there. As for who pays... well.. I am told by very reliable sources that corporate will pay for the internationalization of products deemed necessary by the involved subsidiaries, starting in FY '94. We've submitted a prioritized list of what we need, and as far as I know the funding discussions should be well under way at this time. Past experience indicates that it will be the beginning of Calendar year 1994 before we see much of anything. I hear all this will change come FY'95.. Keep your fingers crossed! Suz
1041.35		4GL::LASHER	Working...	`Tue Aug 10 1993 05:50`	4
	While y'all are looking into this, could you also check to see whether DECwindows supports Orthodox icons? Lew Lasher
1041.36	Spanish Alphabetical Order Simplified	REGENT::BROOMHEAD	Don't panic -- yet.	`Mon May 02 1994 10:26`	45
	<<< NOTED::DISK$NOTES7:[NOTES$LIBRARY_7OF4]WORLDWIDE.NOTE;2 >>> -< Worldwide -- International Product Issues >- ================================================================================ Note 525.0 Change in Spanish collating rules No replies R2ME2::HINXMAN "It's waiting for it that's so tryin" 39 lines 2-MAY-1994 07:58 -------------------------------------------------------------------------------- Days in dictionary numbered for two in Spanish alphabet ======================================================= Associated Press (Boston Globe 1994-05-01) MADRID - The world's more than 300 million Spanish speakers now have two fewer letters in their alphabet to worry about, a mostly bookkeeping move that won almost unanimous support but disturbed some traditionalists. The Association of Spanish Language Academies, meeting in Madrid for its 10th annual congress, voted last week to eliminate the "Ch" an "Ll" from the Spanish alphabet. The two letters, which historically have had their own separate headings in dictionaries, now will be listed under other letters. Words beginning with "Ch", like "chico", will fall under the letter "C", and words beginning with "Ll", like "llama", will fall under the letter "L". The move does not change pronunciation, usage or spelling. It was made mainly to simplify dictionaries and make Spanish more computer- compatible with English. Pushing for the change was Spain, a member of the 12-nation European Union. The EU has urged its members to implement measures that aid translation and computer standardization. Cuban delegate Luisa Campuzano said he favored the change "because it means that dictionaries will be easier to use. But arguments related to the European Union shouldn't be brought up. Our talks are along scientific lines and nothing more." The vote Wednesday was 17 in favor, one opposed and three abstaining. Ecuador voted "no" and Panama, Nicaragua and Ecuador abstained. "It's not that the letters are disappearing, they're just being put in a different place in the dicitionary," said a Madrid artist, Maria Gato. "I don't think most people are upset." Guatemala supported the change, but one Guatemalan delegate, Mario Alberto Carrera, referred to the simplification as "killing" part of the language. "The two letters have succumbed to the dictates of the market and the Anglo-Saxon world," Carrera said. Some dictionaries, including the highly respected Maria Moliner, had already made the change. The Spanish alphabet now has 27 letters - the 26 contained in the alphabet plus a stylized "n".
1041.37		NOVA::FISHER	Tay-unned, rey-usted, rey-ady	`Thu May 05 1994 06:46`	9
	aye, the contrariness of it all.... One of th efun parts of "internationalizing Rdb" was to assure that "c*" did not MATCH "chxyz" when SPanish was the collating sequence in use. Drat! ed
1041.38		JIT081::DIAMOND	$ SET MIDNIGHT	`Mon May 16 1994 01:47`	10
	Re .36 > "The two letters have succumbed to the dictates of the market and the >Anglo-Saxon world," Carrera said. Cute opinion. Has the Library of Congress changed their lexicography to consider Mc as Mc instead of as Mac? If they did or will, they're succumbing to the dictates of the market and the Spanish world. -- Norman Diamond