[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

336.0. "Lempel-Ziv Compression" by REX::MINOW () Mon Sep 30 1985 14:36

This note is a response to a query in 334.4 about the Lempel-Ziv data
compression algorithm. It is abstracted from Terry Welch's article: 

  "A Technique for High Performance Data Compression."
  Terry A. Welch. IEEE Computer Vol 17, No. 6 (June 1984), pp 8-19.

The algorithm builds a string translation table that maps substrings in
the input into fixed-length codes. The compress algorithm may be
described as follows: 

  1. Initialize a table to contain single-character strings.
  2. Read the first character.  Set <w> (the prefix string) to that character.
  3. (step): Read next input character, K.
  4. If at end of file, output code(<w>); exit.
  5. If <w>K is in the string table: Set <w> to <w>K; goto step 3.
  6. Else <w>K is not in the string table.
	Output code(<w>);
	Put <w>K into the string table;
	Set <w> to K; Goto step 3.

"At each execution of the basic step an acceptable input string <w> has
been parsed off.  The next character K is read and the extended string
<w>K is tested to see if it exists in the string table.  If it is there,
then the extended string becomes the parsed string <w> and the step is
repeated.  If <w>K is not in the string table, then it is entered, the
code for the successfully parsed string <w> is put out as comprssed
data, the character K becomes the beginning of the next string, and the
step is repeated." 

The decompression algorithm translates each received code into a prefix
string and extension [suffix] character.  The extension character is
stored (in a push-down stack), and the prefix translated again, until
the prefix is a single character, which completes decompression of this
code.   The entire code is then output by popping the stack. 

"An update to the string table is made for each code received (except
the first one). When a code has been translated, its final character is
used as the extension character, combined with the prior string, to add
a new string to the string table. This new string is assigned a unique
code value, which is the same code that the compressor assigned to that
string. In this way, the decompressor incrementally reconstructs the
same string table that the decompressor used.... Unfortunately ... [the
algorithm] does not work for an abnormal case. 

The abnormal case occurs whenever an input character string contains the
sequence K<w>K<w>K, where K<w> already appears in the compressor string
table." 

The decompression algorithm, augmented to handle the abnormal case, is
as follows: 

  1. Read first input code;
	Store in CODE and OLDcode;
	With CODE = code(K), output(K); FINchar = K;
  2. Read next code to CODE; INcode = CODE;
	If at end of file, exit;
  3. If CODE not in string table (special case) then
	Output(FINchar);
	CODE = OLDcode;
	INcode = code(OLDcode, FINchar);
  4. If CODE == code(<w>K) then
	Push K onto the stack;
	CODE == code(<w>);
	Goto 4.
  5. If CODE == code(K) then
	Output K;
	FINchar = K;
  6. While stack not empty
	Output top of stack;
	Pop stack;
  7. Put OLDcode,K into the string table.
	OLDcode = INcode;
	Goto 2.

The algorithm as implemented introduces two additional complications. 

The actual codes are transmitted using a variable-length encoding. The
lowest-level routines increase the number of bits in the code when the
largest possible code is transmitted. 

Periodically, the algorithm checks that compression is still increasing.
If the ratio of input bytes to output bytes decreases, the entire
process is reset. This can happen if the characteristics of the input
file change. 

The software performs acceptably well -- on VMS it typically compresses
in 3,000-4,000 bytes/second and decompresses at 12,000 bytes/sec.
In addition to obvious system and I/O performance issues, compression
requires a hash-table search per character (with appropriate updates),
followed by the variable-length encoding.  Decompression uses the
variable-length decoding and a few additional memory accesses.  On Unix,
the variable length encoding is done by inline expansion of INSV or
EXTV instructions.  On VMS, these are emulated by C code so I didn't
have to learn assembler.  Welch's article discusses its implementation
in hardware (inside a disk channel controller).

Jerry Leichter (RANI::LEICHTERJ) is maintaining a mailing list on data
compression.  He has a memo available on LZW performance.  People have
noted that it can occasionally wring out a few percent from files
which have already been Huffman encoded.

The software is available from the Toolshed. 

Martin.

T.R	Title	User	Date	Lines
336.1		LATOUR::AMARTIN	`Mon Sep 30 1985 21:46`	2
	Thanks for typing that in. I'll have to go absorb it . . . /AHM/THX
336.2		RANI::LEICHTERJ	`Sun Oct 06 1985 00:22`	5
	Before seeing Martin's note here, I entered a reply to 334 describing the Lem- pel-Ziv algorithm. I've decided to leave that reply there, since it looks at the algorithm in a somewhat different way - which may help clarify what the whole thing is about. -- Jerry
336.3		REX::MINOW	`Tue Oct 15 1985 08:46`	25
	This response to Jerry's explanation in 334 should be read after that one. (Hope you're not too confused.) LZW has one other interesting property that gives it a slight increase in compression (in a real-world algorithm). As Jerry noted, the output codes (values representing input characters and substrings) are fixed length. Actually, this isn't quite the case: there is a maximum output code which is determined by the amount of memory the compressor has allocated to substring/code pairs. However, as the algorithm grinds through the input text, it knows the maximum code that can be produced at any particular time. Thus, an implementation can follow the LZW algorithm itself with a bit-packing "pipeline" optimizer that generates 9-bit bytes, then 10-bit, etc. to the maximum possible. The Unix implementation uses an inline expansion of the INSV instruction to do this efficiently. The VMS (and non-Vax) implementations use an emulation of the instruction. One other optimization possible with LZW is that the implemenation maintains a running compression ratio and reinitializes itself if it stops improving compression. It does this by reserving a code that, when transmitted to the decompressor causes it to reinitialize itself. (Fairly clearly described in the source code.) Martin.