[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference thebay::joyoflex

Title:	The Joy of Lex
Notice:	A Notes File even your grammar could love
Moderator:	THEBAY::SYSTEM

Created:	Fri Feb 28 1986
Last Modified:	Mon Jun 02 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1192
Total number of notes:	42769

333.0. "Hyphenation algorithms?" by DRCS::GODWIN () Wed Mar 11 1987 10:01

    Typesetting systems often include a hyphenation facility using
    hyphenation dictionaries, logical hyphenation or a combination
    of both.
    
    I am curious about "logical hyphenation". What are the typical rules
    that form the basis of such a system? 

    Peter

T.R	Title	User	Personal Name	Date	Lines
333.1	You'll be sorry you asked!	DRAGON::MCVAY	Pete McVay, VRO Telecom	`Wed Mar 11 1987 14:04`	68
	Knuth uses an algorithm called "Liang's method", published in 1982. It is supposedly extremely efficient and works in any language that uses the Roman alphabet. It avoids such horror stories as "The-rapists who pre-ached on wee-knights." Here's a direct quote from "The TeXbook": Here's how it works, using the word 'hyphenation' as an example... The given word is first extended by special markers at either end; in this case we obtain .hyphenation. if '.' denotes the special marker. The extended word has subwords . h y p h e n a t i o n . of length one, .h hy yp ph he en na at ti io on n. of length two, .hy hyp yph phe hen ena nat ati tio ion on. of length three, and so on. Each subword of length k is a pattern that defines k+1 small integer values relating to the desirability of hyphens in the positions between and adjacent to its letters. We can show these values by attaching them as subscripts; for example, '0h0e2n0' means that the values corresponding to the subword 'hen' are 0, 0, 2, and 0, where the 2 relates to hyphens between the 'e' and the 'n'. The interletter values are entirely zero for all subwords [except for patterns provided in a specia 'pattern dictionary' to the hyphenation utility]...and in this case, only the subwords 0h0y3p0h0 0h0e2n0 0h0e0n0a4 0h0e0n5a0t0 1n0a0 0n2a0t0 1t0i0o0 2i0o0 occur as special patterns. TeX now computes the maximum interletter value that occurs at each subword touching each interletter position. for example, between 'e' and 'n' there are four relevant values in this case (2 from 0h0e2n0, 0 from 0h0e0n0a4, 0 from 0h0e0n5a0t0, and 1 from 1n0a0); the maximum of these is 2. The result of all the maximizations is .0h0y3p0h0e2n5a4t2i0o0n0. Now comes the final step: a hyphen is considered to be acceptable between two letters if the associated interletter value is ODD. Thus, two potential breakpoints have been found: 'hy-phen-ation'. Similarly, the word 'concatenation' contains the patterns 0o2n0 0o0n1c0 1c0a0 1n0a0 0n2a0t0 1t0i0o0 2i0o0 and this yields '0c0o2n1c0a0t0e1n2a1t2i0o0n0', i.e., 'con-cate-na-tion'. ...Plain TeX loads exactly 4447 patterns into TeX's memory, beginning with '0.0a0c0h4' and ending with '4z1z2' and '0z4z0y0'. The interletter values in these patterns are all between 0 and 5; a large odd value like the 5 in '0h5e0l0o0' forces desirable hyphen points in words like 'bach-e-lor' and 'ech-e-lon', while a large even value like the 4 in '0h0a0c0h4' suppresses undesirable hyphens in words like 'tooth-aches'. Liang derived these patterns by preparing a special version of Webster's Pocket Dictionary (Merriam, 1966) that contains about 50,000 words including derived forms. Then he checked a a preliminary set of patterns obtained from this data against an up-to-date hyphenation dictionary of about 115,000 words obtained from a publisher; errors found in this run led to the addition of about 1000 words like 'camp-fire', 'Af-ghan-i-stan', and 'bio-rhythm' to the pocket dictionary list.
333.2	oh, THAT Knuth	DELNI::GOLDSTEIN	WAC-E Ideology & Planning	`Wed Mar 18 1987 15:09`	2
	If you understood all that at first reading, You may be able to use TeX!
333.3		WRONGO::PARMENTER	No CIA spooks in White House	`Tue Sep 20 1988 18:07`	11
	The Government Printing Office Style Manual lists the rules for most of the roman alphabet languages. _Words Into Type_ lists a less complete set of these rules, and _The Chicago Manual of Style_ also has a listing of the rules. These rules are in a natural language (English), not a computer language. So it is left as an exercise for the reader to write the software which actually does the line breaking. David
333.4	other division rules?	SUBWAY::KABEL	doryphore	`Thu Jan 30 1992 12:21`	15
	Picking this up after a few years: A recent posting in an Internet group suggested word division rules I have not seen before. Specifically, it suggested that the word eighteen be split eight-teen when divided across lines, and that na�ve be split na-ive (no diaeresis) when divided. This second split makes sense (not that one should expect sensible rules in natural language) because the division itself indicates that the vowels are not joined. Has anyone a pointer to authoritative rules describing such division? Is this British, as contrasted to American, usage?