[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference tallis::celt

Title:Celt Notefile
Moderator:TALLIS::DARCY
Created:Wed Feb 19 1986
Last Modified:Tue Jun 03 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1632
Total number of notes:20523

876.0. "ISO 10646 and Irish Gaelic" by SYSTEM::COCKBURN (Airson Alba Ur) Wed Mar 13 1991 10:10

(also posted in delni::worldwide)

Does anyone have any comments on the mail message below?
If so, could they reply direct to Kevin please. His email address is
decwrl::"[email protected]".

thanks
	Craig


 ------ Forwarded mail received on 13-MAR-1991 at 09:59:05 ------

From:	DECWRL::"GAELIC-L%[email protected]" "GAELIC Language Bulletin Board"
To:	Craig Cockburn <SYSTEM::cockburn> 
Subj:	Re: ISO 10646 and Irish Gaelic 

Rabhadh - teachtaireacht fhada - 230 li/nte
<Warning - long message - 230 lines>
 
Seo litir a chuir mise agus Ciara/n O/ Duibhi/n le che/ile chun i/ a chur
go dti/ an National Standards Association of Ireland.  An bhfuil tuairimi/
ag aon duine eile faoin cheist seo?
 
<This is a letter which Kieran Devine and myself prepared for sending to
 the National Standards Association of Ireland.  Comments very welcome.>
 
===============================================================================
     Proposal to add the letters  d p t  with dot-above to DIS 10646
 
It is proposed that the following six characters, used in the old Irish script
for writing the Irish Gaelic language, be added to DIS 10646 before it
becomes an international standard
 
          LATIN CAPITAL LETTER D WITH DOT ABOVE
          LATIN CAPITAL LETTER P WITH DOT ABOVE
          LATIN CAPITAL LETTER T WITH DOT ABOVE
 
          LATIN SMALL LETTER D WITH DOT ABOVE
          LATIN SMALL LETTER P WITH DOT ABOVE
          LATIN SMALL LETTER T WITH DOT ABOVE
 
The justification of this proposal is as follows.
 
There have been two conventions for writing Irish. The modern standardised
method (Latin script) has been in general use since around 1960. It employs
the Latin alphabet, and is fully covered by IS 8859-1, and therefore also
by DIS 10646.
 
The older writing method, which was generally used in Irish Gaelic books,
newspapers, documents and handwriting until the 1950s, employed a different
script (Gaelic script). There is no coding problem with this, as the
Gaelic-script letters have a one-to-one correspondence to the Latin letters.
As with the German Fraktur font, it is a question of style rather than of
repertoire. There is however one significant difference between the two
writing methods for Irish from the viewpoint of coding standards and
computer processing.
 
Nine of the consonants of Irish exist also in a modified ("lenited") form.
These are:
                     b c d f g m p s t
In the modern writing method, the lenited forms of these consonants are
rendered as letter-pairs, with a "h" following the consonant, thus:
                   bh ch dh fh gh mh ph sh th
In the older writing method, a lenited consonant was indicated by a
"dot-above" accent. It is this difference between a diacritic representation
and a two-letter representation of lenition which leads to this proposal,
not the stylistic difference between the Gaelic and Latin scripts.
 
DIS 10646 already includes 14 letters with dot-above, namely
                  b c e f g h m n r s w x y z
From this it can be seen that three of the nine consonants needed with
dot-above in the old Irish script are currently missing from the DIS, namely,
                           d p t
It is proposed that these three consonants with dot-above, in both upper
and lower case, be added to the repertoire of DIS 10646.
 
Again, the analogy with the Fraktur case is helpful. In Fraktur, umlaut
was indicated by a diacritic (diaeresis-above). In the Latin alphabet,
letter-pairs can be used, e.g. ue for u-umlaut. For good linguistic reasons,
however, German umlaut is recognised as an accent even in the Latin
alphabet, and international standards for character-sets include the umlaut-
bearing vowels rather than compelling the use of two-character sequences.
 
When Irish, like German, moved largely to using the Latin script, a similar
difficulty had to be faced. Irish chose to represent lenited consonants by
two-letter sequences, not through linguistic preference, but due to the
technical difficulty and expense of providing unusual accents in the Latin
alphabet. This difficulty is no longer a consideration.
 
The merits of diacritic (dot-above) over two-letter (h-after) coding of
Irish lenition are similar to those for German umlaut.
 
  . brevity
 
  Lenited consonants in Irish are much more frequent than umlaut in German.
  The use of an extra letter in each such case lengthens text appreciably. It
  is significant that the introduction of the "h-after" method of writing
  Irish was followed by a drastic overall spelling reform.
 
  . distinctness of coding
 
  The letter "h" in Irish, like "e" in German, is used for other purposes than
  indicating lenition (respectively, umlaut). As a consequence, automatic
  processing of Irish text using the "h-after" method is not straightforward,
  and automatic resolution of the functions of "h" can hardly ever aspire to
  being 100% reliable. On the other hand, text encoded with "dot-above" is
  readily converted to "h-after" for output.
 
  . lexical order
 
  The choice of "dot-above" or "h-after" to represent lenition affects lexical
  order. In the standard Gaelic-English dictionary in the first half of this
  century (Dinneen), the words 'cat', 'cath', 'catach' appear in this order
  (that is, lenition is treated diacritically), whereas in the more recent
  standard dictionary (O/ Do/naill) they appear in the order 'cat', 'catach',
  'cath' (it being infeasible with "h-after" material to do other than sort "h"
  everywhere as an ordinary letter).
 
  Initial lenition is removed from dictionary head-words, with the exception
  of several dozen common words (e.g. 'thar') in which it is a permanent
  feature.  The "h-after" method and its consequent lexical ordering pose a
  slight problem for learners in consulting dictionaries in such cases,
  whereby the head-word obtained by erroneously removing the 'h' will not be
  adjacent to the correct head-word, but more or less remote from it.
 
  In sorting surnames, using the "h-after" system results in the wide separation
  of male surnames such as
            O/ Donnai/le                Mac Donnacha
  from their lenited female counterparts
            Ni/ Dhonnai/le              Nic Dhonnacha
            Ui/ Dhonnai/le              Mhic Dhonnacha
  so that brothers and sisters, husbands and wives, become lexically separated.
 
The addition of the requested symbols to DIS 10646 would provide support for
both writing methods, applicable to both scripts.
 
For the Gaelic script, it would enable the construction of computer fonts
within the standards. Such fonts have often been designed for laser printers,
and a Gaelic font for the Macintosh, complete with dotted consonants, has
recently been developed at University College Dublin and is likely to be
marketed by Apple Macintosh. Processing of Irish text in the Gaelic script
may be needed for many purposes, such as artistic use, reproduction of old
books and manuscripts, and optical character recognition of old books.
 
However, this proposal does not rest on any desire for a general regression to
the Gaelic script, any more than recognition of the diacritic nature of
German umlaut requires use of Fraktur. It may be noted that, in spite of
the technical difficulties, there has been a long history of Irish book
publishing in the Latin alphabet but using dotted consonants. Used within the
Latin script, diacritic representation of lenition would enable the recovery
of lost theoretical advantages, with respect to lexical order in particular.
_____________________________________________________________________________
 
@NEWPAGE
                         Additional Notes  [To be sent as information to NSAI]
 
Development of DIS 10646
------------------------
The committe developing IS 10646 is ISO/IEC JTC1/SC2/WG2.
The technical editor is Masami Hasegawa, who works for Digital (in Japan?).
His electronic mail address is
               [email protected]
 
The deadline for voting to turn DIS 10646 into a full international standard
is 1991-06-06.  Normally only very minor editorial changes are made when
turning a DIS into an IS.  However, in the case of IS 10646 it is clear that
many changes will have to continue to be made, both now and in the future.
For example, the standard still only copes with very few of the African
and native American languages, and in many cases it is still not clear
what the standard alphabet is for these languages.
 
Unicode
-------
You may may be aware that there is, in addition to DIS 10646, a "rival"
multibyte character set "industry" standard called "Unicode".  This is
being developed in a very open way by a consortium including, but not
limited to, Apple, IBM, NeXT, Xerox, Sun, Microsoft, and the Research
Libraries Group.
 
Unicode also attempts to encode all the characters used in all the languages
of the world, but in two bytes instead of up to four as in DIS 10646.  It
does this by:
 
  1. "Han unification" - uniting the ideographs used for writing Chinese,
     Japanese and Korean.  Historical these are closely related, but they
     have usually developed slightly different presentation forms and
     sometimes different meanings in different countries.  Han unification
     involves ignoring traditional orderings and existing national standards
     and has been rejected by many Japanese companies.  By contrast
     DIS 10646 just subsumes the existing Chinese, Japanese and Korean
     standards.
 
  2. Leaving only a limited number of positions for control characters.
     DIS 10646 avoids not only the traditional first 32 "ASCII" control
     characters, but also their 8-bit equivalents, and moreover it does
     this for each of the four bytes, wasting a huge number of code values.
     This is widely regarded as extremely wasteful and unnecessary, especially
     since escape sequences are normally used nowadays instead of control
     characters.  However, Unicode's use of control character positions is
     likely to cause problems with many existing computer networks and systems.
 
  3. Representing accented characters by two characters - an ordinary
     character followed by a "floating" diacritic.  This has caused some
     people to question Unicode's claim to encode all characters in two
     bytes.  It means that all possible dotted consonants are already supported
     in Unicode, since they can be represented using the floating diacritic
     "dot-above".  By contrast, DIS 10646 explicitly prohibits the use of
     floating diacritics or backspacing or anything of that nature to
     produce accented Latin-alphabet letters.
 
There have been proposals to unify DIS 10646 and Unicode, bringing Unicode
within the ISO fold.
 
Whatever the relative technical merits of DIS 10646 and Unicode, it would
seem prudent to ensure that IS 10646 as well as Unicode is capable of
representing the characters of the old Irish script.  The European Computer
Manufacturer's Association has come down firmly in favour of DIS 10646
rather than Unicode.  Even if DIS 10646 does not go forward this summer to
become a full international standard, but instead is completely reworked,
getting the dotted constants, d p t, included in the DIS 10646 repertoire,
makes it much less likely that they will be forgotten in any future multibyte
standard.
 
About Ourselves
---------------
Dr Ciara/n O/ Duibhi/n:    I have been a lecturer in Computational
Linguistics in the Department of Computer Science at Queen's University
Belfast since 1984.  I work on text processing and natural language
processing with special reference to Irish, including an on-line database
of Irish texts, and morphological and syntactic synthesis of Irish
sentences with a view to machine translation.  I have been a member of
An Foras Ri/omheolais since 1987.
 
Dr Kevin P. Donnelly:    I work on computing full-time at the Forestry
Commission Northern Research Station, Edinburgh, using Edinburgh University
computers.  Interest in Irish and Scottish Gaelic has led me to follow
closely the character set standards debate on the international computing
mail bulletin boards.  I am one of the joint founders and one of the joint
"owners" of the successful GAELIC-L electronic mail conference, centred at
University College Dublin, and which now has about 170 members.
 
==========================================================================
A point not covered in the above letter is that in association with the old
Irish script there is a distinctive form of ampersand, which looks very
like a slightly lowered digit 7.  Its use however is very much wider than
the ampersand & in English - in old books and manuscript it is used for
every occurrence of the word "agus" (which means "and").  My inclination is
to treat it as a font change variant of '&', of no relevance to character
set standards.  A complicating factor, however, is that it is often used
even with the Roman typeface in modern editions of old books.
Any comments would be welcome.
==========================================================================
 
% ====== Internet headers and postmarks (see DECWRL::GATEWAY.DOC) ======
Received: by enet-gw.pa.dec.com; id AA02689; Wed, 13 Mar 91 01:57:35 -0800
Received: from vtvm2.cc.vt.edu by VTVM2.CC.VT.EDU (IBM VM SMTP V2R1)
   with BSMTP id 0836; Wed, 13 Mar 91 04:56:43 EST
Received: by VTVM2 (Mailer R2.07) id 8719; Wed, 13 Mar 91 04:56:29 EST
Date:         Tue, 12 Mar 91 19:24:54 gmt
Reply-To: GAELIC Language Bulletin Board <GAELIC-L%[email protected]>
Sender: GAELIC Language Bulletin Board <GAELIC-L%[email protected]>
From: C.P.ODonnaile%[email protected]
Subject:      Re:  ISO 10646 and Irish Gaelic
To: Craig Cockburn <SYSTEM::cockburn>
In-Reply-To:  Your message <[email protected]>
T.RTitleUserPersonal
Name
DateLines