[Search for users]
[Overall Top Noters]
[List of all Conferences]
[Download this site]
Title: | Mathematics at DEC |
|
Moderator: | RUSURE::EDP |
|
Created: | Mon Feb 03 1986 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 2083 |
Total number of notes: | 14613 |
336.0. "Lempel-Ziv Compression" by REX::MINOW () Mon Sep 30 1985 15:36
This note is a response to a query in 334.4 about the Lempel-Ziv data
compression algorithm. It is abstracted from Terry Welch's article:
"A Technique for High Performance Data Compression."
Terry A. Welch. IEEE Computer Vol 17, No. 6 (June 1984), pp 8-19.
The algorithm builds a string translation table that maps substrings in
the input into fixed-length codes. The compress algorithm may be
described as follows:
1. Initialize a table to contain single-character strings.
2. Read the first character. Set <w> (the prefix string) to that character.
3. (step): Read next input character, K.
4. If at end of file, output code(<w>); exit.
5. If <w>K is in the string table: Set <w> to <w>K; goto step 3.
6. Else <w>K is not in the string table.
Output code(<w>);
Put <w>K into the string table;
Set <w> to K; Goto step 3.
"At each execution of the basic step an acceptable input string <w> has
been parsed off. The next character K is read and the extended string
<w>K is tested to see if it exists in the string table. If it is there,
then the extended string becomes the parsed string <w> and the step is
repeated. If <w>K is not in the string table, then it is entered, the
code for the successfully parsed string <w> is put out as comprssed
data, the character K becomes the beginning of the next string, and the
step is repeated."
The decompression algorithm translates each received code into a prefix
string and extension [suffix] character. The extension character is
stored (in a push-down stack), and the prefix translated again, until
the prefix is a single character, which completes decompression of this
code. The entire code is then output by popping the stack.
"An update to the string table is made for each code received (except
the first one). When a code has been translated, its final character is
used as the extension character, combined with the prior string, to add
a new string to the string table. This new string is assigned a unique
code value, which is the same code that the compressor assigned to that
string. In this way, the decompressor incrementally reconstructs the
same string table that the decompressor used.... Unfortunately ... [the
algorithm] does not work for an abnormal case.
The abnormal case occurs whenever an input character string contains the
sequence K<w>K<w>K, where K<w> already appears in the compressor string
table."
The decompression algorithm, augmented to handle the abnormal case, is
as follows:
1. Read first input code;
Store in CODE and OLDcode;
With CODE = code(K), output(K); FINchar = K;
2. Read next code to CODE; INcode = CODE;
If at end of file, exit;
3. If CODE not in string table (special case) then
Output(FINchar);
CODE = OLDcode;
INcode = code(OLDcode, FINchar);
4. If CODE == code(<w>K) then
Push K onto the stack;
CODE == code(<w>);
Goto 4.
5. If CODE == code(K) then
Output K;
FINchar = K;
6. While stack not empty
Output top of stack;
Pop stack;
7. Put OLDcode,K into the string table.
OLDcode = INcode;
Goto 2.
The algorithm as implemented introduces two additional complications.
The actual codes are transmitted using a variable-length encoding. The
lowest-level routines increase the number of bits in the code when the
largest possible code is transmitted.
Periodically, the algorithm checks that compression is still increasing.
If the ratio of input bytes to output bytes decreases, the entire
process is reset. This can happen if the characteristics of the input
file change.
The software performs acceptably well -- on VMS it typically compresses
in 3,000-4,000 bytes/second and decompresses at 12,000 bytes/sec.
In addition to obvious system and I/O performance issues, compression
requires a hash-table search per character (with appropriate updates),
followed by the variable-length encoding. Decompression uses the
variable-length decoding and a few additional memory accesses. On Unix,
the variable length encoding is done by inline expansion of INSV or
EXTV instructions. On VMS, these are emulated by C code so I didn't
have to learn assembler. Welch's article discusses its implementation
in hardware (inside a disk channel controller).
Jerry Leichter (RANI::LEICHTERJ) is maintaining a mailing list on data
compression. He has a memo available on LZW performance. People have
noted that it can occasionally wring out a few percent from files
which have already been Huffman encoded.
The software is available from the Toolshed.
Martin.
T.R | Title | User | Personal Name | Date | Lines |
---|
336.1 | | LATOUR::AMARTIN | | Mon Sep 30 1985 22:46 | 2 |
| Thanks for typing that in. I'll have to go absorb it . . .
/AHM/THX
|
336.2 | | RANI::LEICHTERJ | | Sun Oct 06 1985 01:22 | 5 |
| Before seeing Martin's note here, I entered a reply to 334 describing the Lem-
pel-Ziv algorithm. I've decided to leave that reply there, since it looks at
the algorithm in a somewhat different way - which may help clarify what the
whole thing is about.
-- Jerry
|
336.3 | | REX::MINOW | | Tue Oct 15 1985 09:46 | 25 |
| This response to Jerry's explanation in 334 should be read after that one.
(Hope you're not too confused.)
LZW has one other interesting property that gives it a slight increase
in compression (in a real-world algorithm). As Jerry noted, the output
codes (values representing input characters and substrings) are fixed
length. Actually, this isn't *quite* the case: there is a maximum
output code which is determined by the amount of memory the compressor
has allocated to substring/code pairs. However, as the algorithm
grinds through the input text, it knows the *maximum* code that
can be produced at any particular time. Thus, an implementation
can follow the LZW algorithm itself with a bit-packing "pipeline"
optimizer that generates 9-bit bytes, then 10-bit, etc. to the
maximum possible. The Unix implementation uses an inline expansion
of the INSV instruction to do this efficiently. The VMS (and non-Vax)
implementations use an emulation of the instruction.
One other optimization possible with LZW is that the implemenation
maintains a running compression ratio and reinitializes itself
if it stops improving compression. It does this by reserving
a code that, when transmitted to the decompressor causes it
to reinitialize itself. (Fairly clearly described in the source
code.)
Martin.
|