T.R | Title | User | Personal Name | Date | Lines |
---|
828.1 | | CADM::ROTH | If you plant ice you'll harvest wind | Mon Feb 22 1988 07:14 | 14 |
| If you could encode the data in such a way that entropy was minimized
then that would be optimal. Since you know the probablilities, simple
Huffman encoding would give the best bit stream I'd think (assuming that
you are using bit streams and not more general symbols.)
Early in the sequence the tree would branch all to one side, since
for example zero should have a leaf all by itself right off of the
topmost branch. But far into the series the probablility of choosing
one branch or another would average out (since (n+1)/(n+2) -> 1),
and the subtrees should be asymptotically balanced.
I'm too lazy to work out the details right now though.
- Jim
|
828.2 | Is this Huffman? | IOSG::CARLIN | you bet he doesn't drink Black Label | Tue Feb 23 1988 16:38 | 37 |
| Try this as a marker to judge other solutions by.
0 => 0
1 => 10
2,3 => 1100,1101
4,5,6,7 => 111000,111001,111010,111011
8,9,10,11,12,13,14,15 => 11110000,.........
so mean length =
(1/1-1/2).1
+(1/2-1/3).2
+(1/3-1/4).4 +(1/4-1/5).4
+(1/5-1/6).6 +.................+(1/8-1/9).6
+...
= (1/1-1/2).1
+(1/2-1/3).2
+(1/3-1/5).4
+(1/5-1/9).6
+...
_ r
= 1/2 + 2\ 1/(2 +1)
/_
r=0
r r
Taking 1/(2 ) as an upper bound for 1/(2 +1) after a few terms gives
an upper bound for the mean length of:
1/2 + 2(1/2+1/3+1/5+1/9) + 2(1/8)
Which I make 3.04 by hurried calculation. Seems quite good, in fact
too good. Where's the mistake?
Dick
|
828.3 | | CLT::GILBERT | Builder | Tue Feb 23 1988 17:06 | 12 |
| > I wish to store, as a series of bits, an ordered sequence of 'random'
> non-negative (zero to infinity) integers. The distribution [...]
Are the numbers in the sequence distinct? If not, then it seems
worthwhile to include the 'multiplicity' of the "0" in the sequence.
If the numbers *are* distinct, then you can easily improve on the
Huffman encoding (of the previous reply) by making use of the fact
that the numbers are in ascending order. The first bit indicates
whether "0" is in the sequence, and use Huffman encoding on the
remaining numbers. This decreases the encoding by about one bit
per number.
|
828.4 | | SSDEVO::LARY | | Tue Feb 23 1988 19:31 | 4 |
| If the numbers are in ascending order, then encode the difference between
numbers rather than the numbers themselves; since encoded length increases
with magnitude, this can only win!
|
828.5 | Its just a 'sequence'. | CHOVAX::YOUNG | Back from the Shadows Again, | Wed Feb 24 1988 01:02 | 22 |
| Re .3, .4:
Sorry, I guess "ordered series" is a little redundant. I only mean
that <0,3,0,1> is distinct from <0,0,1,3>. As opposed to say, sets
where order is unimportant. I did NOT mean that the integers were
sorted or necessarily ascending. Its just a sequence of integers.
Re .1:
I am far from an expert with Huffman(sp?) encoding algorithims,
and I must confess that you have lost me after the word "..tree..".
Guess I could look it up though. However what little I do know
leads me to 2 additional questions, 1) H-man encoding as I have
heard it described seems to be intended only to address encoding
data from a finite range of values, NOT from an infinite range as
this problem outlines. Perhaps I just need a more explicit description
of your approach? and 2) Papers that I have read seem to indicate
that arithmetic encoding is in general more efficient than H-man
encoding.
-- Barry
|
828.6 | will this work? | CADM::ROTH | If you plant ice you'll harvest wind | Wed Feb 24 1988 08:23 | 31 |
| It should be possible to search for an optimal encoding pretty
easily, though it's not clear if a simple closed form expression
exists.
Suppose you encode the numbers in sets of words that look like this:
1111111xxxx
where the i'th set of words has m[i] leaading one's followed by a k[i]
bit binary integer in the range [0,2^k[i]-2]. This set encodes 2^k[i]-1
consecutive numbers in the range n[i] to n[i+1]-1.
This encoding would be completely characterized by a nondecreasing
sequence of k's, and we would have the relations
n[i+1] = n[i]+2^k[k]-1 First number encoded in [i+1]'th word set
m[i+1] = m[i]+k[i] Number of leading 1's in [i+1]'th word set
The probablility of the i'th word set being used would be
p[i] = (2^k[i]-1)/((n[i]+1)*(n[i+1]+1))
and you want to minimize the expected length of the code words
E(m+k) = SUM(i > 0) p[i]*(m[i]+k[i])
Just search for the sequence of k's by successively increasing the
trailing values of k past the current index until an increase in E(m+k)
is seen and going on to the next k in the sequence.
- Jim
|
828.7 | Where do the one's end? | IOSG::CARLIN | Perrin here, on green | Sat Feb 27 1988 13:41 | 36 |
| re .6
> Suppose you encode the numbers in sets of words that look like this:
>
> 1111111xxxx
>
> where the i'th set of words has m[i] leaading one's followed by a k[i]
>bit binary integer in the range [0,2^k[i]-2]. This set encodes 2^k[i]-1
>consecutive numbers in the range n[i] to n[i+1]-1.
>
1. Don't you need a zero "stopper" to show where the sequences of
ones end?
2. Why not let the xxxx's run from 0 to 2^k[i]-1?
3. isn't the optmal m[i] simply i?
So the i'th set consists of numbers of the form:
111110xxxxxx
(i 1's / 0 / k-bit number)
As you say the optimal k sequence would need an infinite look up
table to encode the numbers. I chose:
k[0] = 0
k[i] = i-1 (i>0) in reply.2
Jim, can you elaborate on that method you gave to find the optimal
k sequence?
Cheers
Dick
|
828.8 | | CLT::GILBERT | Builder | Sat Feb 27 1988 14:58 | 48 |
| The Huffman (optimal) encoding can be achieved by the following
algorithm (when there are only a finite number of characters).
Let S = { (c0,p0), (c1,p1), ..., (cn,pn) } be the set of 'character',
probability pairs. If there is only one pair, encode that character
with the null string, and we are done (except for backsubstitutions,
below).
Otherwise, there are at least two pairs in S, so choose two having the
lowest probabilities, (cx,px) and (cy,py). Create a new character cz.
The encoding for cx will be the encoding for cz followed by "0", and
cy will be encoded by the the encoding for cz followed by "1". Remove
(cx,px), (cy,py) from S, and insert (cz,px+py) into S. And repeat.
For example, suppose we have:
{ (a,1/2), (b,1/6), (c,1/12), (d,1/20), (e,1/30),
(f,1/30), (g,1/30), (h,1/30), (i,1/30), (j,1/30) }
A tree shows the effect of running the above algorithm:
1
/ \
1/2 1/2
/ \
/ \
17/60 13/60
/ \ / \
1/6 7/60 1/12 2/15
/ \ / \
/ \ / \
1/20 1/15 1/15 1/15
1/30 1/30 1/30 1/30 1/30 1/30
Where each leaf represents one of the characters in the original set.
We get the encoding:
a (1/2) => "0"
b (1/6) => "100"
d (1/20) => "1010"
e (1/30) => "10110"
f (1/30) => "10111"
c (1/12) => "110"
g (1/30) => "11100"
h (1/30) => "11101"
i (1/30) => "11110"
j (1/30) => "11111"
|
828.9 | | SSDEVO::LARY | | Sun Feb 28 1988 21:35 | 33 |
| In an infinite set like this, a most-probable-first scheme might work
better:
- 0 occurs 1/2 of the time so we should allocate 1/2 the "space"
to it, so it gets the string "0".
- 1 and 2 occur 1/2 of the remainder of the time, so we should
allocate 1/2 the remaining "space" to them, so 1 = 100 and 2 = 101
- 3, 4, 5 and 6 occur 1/2 of the remainder of the time, so we should
allocate 1/2 of the remaining "space" to them, so:
3 = 11000, 4 = 11001, 5 = 11010, 6 = 11011
- In general, we map the interval [2^n-1,2^(n+1)-2] into:
<n ones>0<n bit index within interval>
since sigma(p(i)) over i in [2^n-1,2^(n+1)-2] equals 1/2^n
Now, what's the average length of this encoding?
The average length of encoding is:
(1/2)*1 + (1/4)*3 + (1/8)*5 + (1/16)*7 + ... + (1/2^n)*(2n-1) + ....
= 3
So you can encode these (unbounded) numbers into an average of 3 bits apiece!
Now - is this minimal? I thought so, when I started thinking about it,
but I don't like the fact that there are no encodings of length 2, 4, etc.
|
828.10 | a 1% improvement | CLT::GILBERT | | Sun Mar 20 1988 18:32 | 16 |
| The following encoding uses an average of 657/220 bits per number.
0 "0"
1 "100" 11 "1100101" 21 "110000010"
2 "1010" 12 "1100111" 22 "110000011"
3 "1101" 13 "1111010" 23 "110011000"
4 "10110" 14 "11000000" 24 "110011001"
5 "11100" 15 "11001000" 25 "111011000"
6 "11111" 16 "11001001" 26 "111011001"
7 "110001" 17 "11001101" 27 "111011100"
8 "111010" 18 "11101101" 28 "111011101"
9 "111100" 19 "11101111" 29 "111101100"
10 "1100001" 20 "11110111" 30 "111101101"
31..62 "10111000000".."10111011111"
63..126 "1011110000000".."1011110111111"
...
|
828.11 | 2.98137 bits per number | CLT::GILBERT | | Mon Mar 21 1988 14:08 | 76 |
| Let the number n be encoded with a string of d(n) bits. Let p(n) be
the probability of the number n, and let sp(n) = sum(p(i),i>=n).
Let's consider a depth-order traversal of the encoding tree.
/ \
0 / \
/\ /\
1 2 /\/\
....
At the top-most level, every number must have at least one bit in it,
and so we add 1 to the average cost per number. At the next level, we
let d(0) = 1, and all the other numbers must have another bit in their
representations; this increases the average cost by sp(1) = 1/2.
At the next level, we have two 2-bit encodings available. We use neither,
increase the average cost by sp(1) again, and at the next level we have
four 3-bit encodings available.
We use two of these to encode 1 and 2 (so that d(1) = d(2) = 3). All the
following numbers are at a deeper layer, so we increase the cost by sp(3)
= 1/4, and we have (4-2)*2 4-bit encodings available for use on the numbers
3 and up.
In general, we have m different b-bit encodings available to encode the
numbers i and greater. Let f(i,m,b) be the average cost of this. If we
use j of these b-bit encodings (to encode i thru i+j-1), we have:
f(i,m,b) = sp(i+j) + f(i+j,(m-j)*2,b+1)
This function is independent of the third parameter, so we write:
f(i,m) = min sp(i+j) + f(i+j,2*(m-j))
0 <= j < m
This equation holds for encoding any infinite set of numbers. For a finite
set, we would allow j = m if sp(i+j) = 0; i.e., f(i,m) = 0 if sp(i+m)=0.
It terms of the top-most level, we have one 0-bit string, and want to
encode the numbers 0 and greater; we want f(0,1).
The tree drawn above has:
f(0,1) = sp(0) + f(0,2) = 1 + f(0,2) (i.e., j=0)
= sp(0) + sp(1) + f(1,2) (j=1)
= sp(0) + sp(1) + sp(1) + f(1,4) (j=0)
= sp(0) + sp(1) + sp(1) + sp(3) + f(3,4) (j=2)
In the problem, sp(n) = 1/(n+1), so this is 1 + 1/2 + 1/2 + 1/4 + f(3,4).
The following seems to give a near-optimal value of f(0,1) for our problem.
d(0) = 1
d(1) = d(2) = 3
d(3) = 4
d(4) = 5
d(5) = ... = d(7) = 6
d(8) = ... = d(11) = 7
d(12) = ... = d(17) = 8
...
f(0,1) = 1 + 1/2 + 1/2 + 1/4 + 1/5 + 1/6 + f(5,10).
f(5,10) = 1/9 + f(8,14). (j = 3)
f(8,14) = 1/13 + f(12,20). (j = 4)
f(12,20) = 1/19 + f(18,28). (j = 6)
f(18,28) = 1/27 + f(26,40). (j = 8)
...
f(a,b) = sp(b-2) + f(b-2,2a+4). (j = b-a-2)
...
f(5*2^n-2, 7*2^n) = sp( 7*2^n-2) + f( 7*2^n-2, 10*2^n), j = 2*2^n
f(7*2^n-2, 10*2^n) = sp(10*2^n-2) + f(10*2^n-2, 14*2^n), j = 3*2^n
...
This gives an average encoding of about 2.98137 bits per number.
|
828.12 | Bounding the optimum cost from below | SSDEVO::LARY | | Wed Mar 23 1988 16:13 | 32 |
| > f(i,m) = min sp(i+j) + f(i+j,2*(m-j))
> 0 <= j < m
Finding the value of f(0,1) is difficult when sp(n) = 1/(n+1), as this is a
forward recurrence; however, if rigor is thrown to the winds you can claim
that:
f(0,1) = lim c(k),
k->inf /
| 1/(n+1), n <= k
where c(k) is the value of f(0,1) when sp(n) = |
| 0, n > k
\
and that the limit is reached from below.
Some values of c(k) are:
c(100) = 2.925034
c(200) = 2.952711
c(300) = 2.962501
c(400) = 2.966932
c(500) = 2.969769
c(600) = 2.971856
c(700) = 2.973080
c(800) = 2.974121
c(900) = 2.975006
c(1000)= 2.975558
c(1500)= 2.977481
c(2000)= 2.978456
So, 2.98137 looks pretty close to the optimum!
Richie
|
828.13 | | CLT::GILBERT | | Tue Apr 05 1988 12:50 | 18 |
| Some computer runs done by Richie Lary seem to indicate that for
large values, j = m-i-2 doesn't seem to always minimize the f(i,m)
in the recurrence:
f(i,m) = min sp(i+j) + f(i+j,2*(m-j))
0 <= j < m
However, the j value always seems to be within 2 of m-i. If we
can prove that sp(i+j) + f(i+j,2*(m-j)) is concave, then we can
optimize the calculations of f(i,m) by trying j=m-i and value
near it. The goal is to develop a closed form expression for
the optimal average encoding (e.g., the 2.981... value).
By concave, I mean that for g(j) = sp(i+j) + f(i+j,2*(m-j)), we have:
p < q < J < r < s implies g(p) >= g(q) >= g(J) <= g(r) <= g(s)
where J is a (or the) j value that minimizes g(j).
|
828.14 | Perfect coding. | PBSVAX::COOPER | Topher Cooper | Tue Apr 12 1988 14:31 | 48 |
| A "perfect" code would encode the i'th message with:
-lg(p ) = lg(1/p )
i i
bits, where lg(x) is the logrithm to base 2 and p is the probability
i
of the i'th message occurring. The average code-length per message
(the "information" in the code) is then:
sum p lg(1/p )
i i i
where the summation i is over all possible messages.
Any practical code will have an average message length greater than
or equal to this sum.
For this particular problem, a bit of algebra gives us:
inf lg i
2 SUM ------
i=2 2
i - 1
Unfortunately this converges very slowly, if at all. I have tried
to determine if it converges by various means. The simplest test
(the limit of the ratio of adjacent terms) is ambiguous (the ratio
equals 1, which means it may or may not converge).
The good news is, if it does diverge (which would mean that no
practical code could actually exist for the complete series) it
does so very slowly. This means that practical codes can be created
for truncated versions of the series with the truncation occuring
for high i. This would explain the seeming successes so far.
If we truncate at i=100 the average code length will be 2.791; at
i=200 2.861; at i=300 2.888 (it *looks* like its converging, but
that may be deceiving).
Anyone want to tackle proving whether or not the series converges?
Setting a reasonable upper bound on what it converges to?
(NOTE: The ideal can generally be approached more closely by coding
n-tuples rather than single "messages". As n increases the coding
approaches the ideal as a limit).
Topher
|
828.15 | | SSDEVO::LARY | | Tue Apr 12 1988 20:25 | 32 |
| > inf lg i
> 2 SUM ------
> i=2 2
> i - 1
>
This can be shown to converge by comparison with another series:
2 1.9
For i > 16, i - 1 > i
0.5
For i > 16, lg i < sqrt(i) = i
lg i -1.4 -x
Therefore, for i > 16, ------- < i , and SUM i converges for x > 1
2
i - 1
The sum of the series through i=1000000 is 2.9521+
-1.75 -1.75
For i > 1000000, each term is less than i , and the integral of 2x dx
from 1000000 to infinity is about .000036, so 2.9522 is a safe upper bound.
Is there a theorem that says you can get arbitrarily close to the minimum
entropy with a discrete encoding? If so, we need to backtrack.....
Richie
|
828.16 | Think so, but not sure. | PBSVAX::COOPER | Topher Cooper | Wed Apr 13 1988 11:53 | 8 |
| I think that there is such a theorem, but don't know for sure (I
moved in January but my books are all still in boxes due to repairs
on the house). I'll check the library. In any case it is certainly
the "popular wisdom". I would be very surprised if better encodings
could not be accomplished by looking at n-tuples rather than
singletons.
Topher
|
828.17 | It's a theorem. | PBSVAX::COOPER | Topher Cooper | Wed Apr 13 1988 15:26 | 19 |
| Checked "Principles of Digital Communication" by Das, Mullick and
Chatterjee in the HLO library this lunch hour. It is indeed a theorem
and is known as Shannon's First Fundamental Theorem. It is stated
as:
Suppose that there are a zero-memory source S with the entropy
H(X) per symbol and a noiseless channel with a capacity C bits/
message. Then it is possible to encode the output of S (with
the encoder delivering messages to the channel) so that all
information generated in S is transmitted through the channel
without loss if, and only if C >= H(X).
(page 537). The case where the transmission can occur when C =
H(X) is the case we've been talking about.
The proof is essentially based on encoding n-tuples, with the limit
as n -> infinity.
Topher
|
828.18 | Intro to arithmetic coding | SSDEVO::LARY | | Mon May 02 1988 07:16 | 72 |
| The reason why all of our proposed encodings are a bit worse than the
theoretical minimum seems to lie in the nature of binary encoding schemes.
Whenever we "group" a series of consecutive numbers into a set of codes of
the same length, we ignore the small differences in probability of
occurrence among the numbers in the set. This destroys information and
increases the encoded length.
What we need is "fractional bit" encodings. Do they exist? Of course they do!
Here's a neat one known as "Arithmetic coding", cribbed from some published
papers by Ian Witten and John Cleary of the University of Calgary, and extended
from finite alphabets to denumerable ones (a la the base note):
Given: A denumerable set of characters X with P(n) = the occurrence probability
of the nth char, and a sequence x(i) of characters from X
n-1
Set S(n) = sum P(n)
1
Set L=0, H=1
For each x(i) in the sequence, find its index n in the character set
and set L = L + (H-L)*S(n), H = L + (H-L)*S(n+1).
When you've processed all the x(i), ANY number in the semi-open
interval [L,H) is an encoding of the sequence.
An example helps here. Lets say the set X is {a,b,c},
P(1) = 0.5 S(1) = 0
P(2) = 0.2 S(2) = 0.5
P(3) = 0.3 S(3) = 0.7
S(4) = 1.0
To encode the string "abcba"
Start: [L,H) = [0,1)
Process a, n=1: [L,H) = [0,0.5)
Process b, n=2: [L,H) = [0.25,0.35)
Process c, n=3: [L,H) = [0.32,0.35)
Process b, n=2: [L,H) = [0.335,0.341)
Process a, n=1: [L,H) = [0,335,0.338)
What we are doing here is giving each character a "chunk" of the interval
[0,1); processing a character consists of extracting that character's
prorated "chunk" of the current interval and making that the new interval.
At first glance, it looks like the encoded result is infinite in size,
but it isn't; there is a number in the interval that can be represented
in at most [1-log2(H-L)] bits. In the example above the encoded
string in binary is 0101011, representing the number .0101011 which is
43/128 = .3359375. The decoding algorithm is left as an exercise for
the reader (I'm lazy).
A Huffman code based on the occurrence probabilities of a, b, and c would
assign a=0, b=10, c=11 and come up with the encoding 01011100, one bit longer.
Arithmetic coding always naturally produces minimum-entropy encodings;
that is, the limit (as m->infinity) of the encoding size of m-element
sequences generated according to the occurrence probabilities P(n), divided
by m, is sum(-P(n)*log2(P(n)). Huffman codings can approximate minimum-entropy
encodings arbitrarily closely, but the tables and encoding algorithms become
arbitrary large and complex in the process.
An approximation to the above arithmetic coding algorithm can be represented
in computers with almost no loss of encoding efficiency, by accumulating
L and H in registers with at least twice the precision of the S(n) plus
a few guard bits. Whenever the values of L and H have any high-order bits
in common these bits are output (since all values in [L,H) will have identical
values in these bits) and L and H are scaled to remove these high-order bits.
(Its a little more complicated than this, but not much).
|