| Knuth uses an algorithm called "Liang's method", published in 1982.
It is supposedly extremely efficient and works in any language that
uses the Roman alphabet. It avoids such horror stories as "The-rapists
who pre-ached on wee-knights."
Here's a direct quote from "The TeXbook":
Here's how it works, using the word 'hyphenation' as an example...
The given word is first extended by special markers at either end;
in this case we obtain
.hyphenation.
if '.' denotes the special marker. The extended word has subwords
. h y p h e n a t i o n .
of length one,
.h hy yp ph he en na at ti io on n.
of length two,
.hy hyp yph phe hen ena nat ati tio ion on.
of length three, and so on. Each subword of length k is a pattern
that defines k+1 small integer values relating to the desirability
of hyphens in the positions between and adjacent to its letters.
We can show these values by attaching them as subscripts; for example,
'0h0e2n0' means that the values corresponding to the subword 'hen'
are 0, 0, 2, and 0, where the 2 relates to hyphens between the 'e'
and the 'n'. The interletter values are entirely zero for all subwords
[except for patterns provided in a specia 'pattern dictionary' to
the hyphenation utility]...and in this case, only the subwords
0h0y3p0h0 0h0e2n0 0h0e0n0a4
0h0e0n5a0t0 1n0a0 0n2a0t0 1t0i0o0 2i0o0
occur as special patterns. TeX now computes the maximum interletter
value that occurs at each subword touching each interletter position.
for example, between 'e' and 'n' there are four relevant values in this
case (2 from 0h0e2n0, 0 from 0h0e0n0a4, 0 from 0h0e0n5a0t0, and 1 from
1n0a0); the maximum of these is 2. The result of all the maximizations
is
.0h0y3p0h0e2n5a4t2i0o0n0.
Now comes the final step: a hyphen is considered to be acceptable
between two letters if the associated interletter value is ODD.
Thus, two potential breakpoints have been found: 'hy-phen-ation'.
Similarly, the word 'concatenation' contains the patterns
0o2n0 0o0n1c0 1c0a0 1n0a0 0n2a0t0 1t0i0o0 2i0o0
and this yields '0c0o2n1c0a0t0e1n2a1t2i0o0n0', i.e., 'con-cate-na-tion'.
...Plain TeX loads exactly 4447 patterns into TeX's memory, beginning
with '0.0a0c0h4' and ending with '4z1z2' and '0z4z0y0'. The
interletter values in these patterns are all between 0 and 5; a
large odd value like the 5 in '0h5e0l0o0' forces desirable hyphen
points in words like 'bach-e-lor' and 'ech-e-lon', while a large
even value like the 4 in '0h0a0c0h4' suppresses undesirable hyphens
in words like 'tooth-aches'. Liang derived these patterns by preparing
a special version of Webster's Pocket Dictionary (Merriam, 1966)
that contains about 50,000 words including derived forms. Then
he checked a a preliminary set of patterns obtained from this data
against an up-to-date hyphenation dictionary of about 115,000 words
obtained from a publisher; errors found in this run led to the addition
of about 1000 words like 'camp-fire', 'Af-ghan-i-stan', and
'bio-rhythm' to the pocket dictionary list.
|
|
The Government Printing Office Style Manual lists the rules for
most of the roman alphabet languages. _Words Into Type_ lists a
less complete set of these rules, and _The Chicago Manual of Style_
also has a listing of the rules.
These rules are in a natural language (English), not a computer
language. So it is left as an exercise for the reader to write
the software which actually does the line breaking.
David
|
| Picking this up after a few years:
A recent posting in an Internet group suggested word division rules
I have not seen before. Specifically, it suggested that the word
eighteen be split eight-teen when divided across lines, and that
na�ve be split na-ive (no diaeresis) when divided.
This second split makes sense (not that one should expect sensible
rules in natural language) because the division itself indicates
that the vowels are not joined.
Has anyone a pointer to authoritative rules describing such
division? Is this British, as contrasted to American, usage?
|