[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

828.0. "Integer data compression:" by CHOVAX::YOUNG (Back from the Shadows Again,) Sun Feb 21 1988 15:05

    OK, heres an interesting question:
    
    I wish to store, as a series of bits, an ordered sequence of 'random'
    non-negative (zero to infinity) integers.  The distribution of these
    integers is definined by:
    
    	P(x > n) = 1/(n+2)
    or:
    	The probabilty that any integer 'x' in the sequense, is greater
    than some arbitray integer 'n' is equal to '1/(n+2)'.
    
    Thus:
    	P(0)	=	      1/2	Half of the numbers are 0.
    	P(1)	= 1/2 - 1/3 = 1/6	1/6th are 1.
    	P(2)	= 1/3 - 1/4 = 1/12
     	p(3)	= 1/4 - 1/5 = 1/20
    	...
    or:				      2
    	P(n)	= 1/n+1 - 1/n+2 = 1/(n +3n + 2)
    
    Now the question is, what is the most space effcient way to represent
    such a sequence?  I guess I should rather call it a challenge to
    see who can come up with the best solution, because I am not certain
    that there is a known solution to this problem.

    --  Barry
T.RTitleUserPersonal
Name
DateLines
828.1CADM::ROTHIf you plant ice you'll harvest windMon Feb 22 1988 07:1414
    If you could encode the data in such a way that entropy was minimized
    then that would be optimal.  Since you know the probablilities, simple
    Huffman encoding would give the best bit stream I'd think (assuming that
    you are using bit streams and not more general symbols.)

    Early in the sequence the tree would branch all to one side, since
    for example zero should have a leaf all by itself right off of the
    topmost branch.  But far into the series the probablility of choosing
    one branch or another would average out (since (n+1)/(n+2) -> 1),
    and the subtrees should be asymptotically balanced.

    I'm too lazy to work out the details right now though.

    - Jim
828.2Is this Huffman?IOSG::CARLINyou bet he doesn't drink Black LabelTue Feb 23 1988 16:3837
    Try this as a marker to judge other solutions by.
    
    0				=>	0
    1				=>	10
    2,3				=>	1100,1101
    4,5,6,7			=>	111000,111001,111010,111011
    8,9,10,11,12,13,14,15	=>	11110000,.........
    
    so mean length =
    
    	(1/1-1/2).1
       +(1/2-1/3).2
       +(1/3-1/4).4 +(1/4-1/5).4
       +(1/5-1/6).6 +.................+(1/8-1/9).6
       +...
    
    =	(1/1-1/2).1
       +(1/2-1/3).2
       +(1/3-1/5).4
       +(1/5-1/9).6
       +...
                _    r
    =	1/2 + 2\ 1/(2 +1)
    	       /_        
     	       r=0
    
               r                            r
    Taking 1/(2 ) as an upper bound for 1/(2 +1) after a few terms gives
    an upper bound for the mean length of:
    
    	1/2 + 2(1/2+1/3+1/5+1/9) + 2(1/8)
    
    Which I make 3.04 by hurried calculation. Seems quite good, in fact
    too good. Where's the mistake?
    
    							Dick
    
828.3CLT::GILBERTBuilderTue Feb 23 1988 17:0612
>   I wish to store, as a series of bits, an ordered sequence of 'random'
>   non-negative (zero to infinity) integers.  The distribution [...]

    Are the numbers in the sequence distinct?  If not, then it seems
    worthwhile to include the 'multiplicity' of the "0" in the sequence.

    If the numbers *are* distinct, then you can easily improve on the
    Huffman encoding (of the previous reply) by making use of the fact
    that the numbers are in ascending order.  The first bit indicates
    whether "0" is in the sequence, and use Huffman encoding on the
    remaining numbers.  This decreases the encoding by about one bit
    per number.
828.4SSDEVO::LARYTue Feb 23 1988 19:314
If the numbers are in ascending order, then encode the difference between
numbers rather than the numbers themselves; since encoded length increases
with magnitude, this can only win!

828.5Its just a 'sequence'.CHOVAX::YOUNGBack from the Shadows Again,Wed Feb 24 1988 01:0222
    Re .3, .4:
    
    Sorry, I guess "ordered series" is a little redundant.  I only mean
    that <0,3,0,1> is distinct from <0,0,1,3>.  As opposed to say, sets
    where order is unimportant.  I did NOT mean that the integers were
    sorted or necessarily ascending.  Its just a sequence of integers.
    
    Re .1:
    
    I am far from an expert with Huffman(sp?) encoding algorithims,
    and I must confess that you have lost me after the word "..tree..".
    Guess I could look it up though.  However what little I do know
    leads me to 2 additional questions, 1) H-man encoding as I have
    heard it described seems to be intended only to address encoding
    data from a finite range of values, NOT from an infinite range as
    this problem outlines.  Perhaps I just need a more explicit description
    of your approach? and 2) Papers that I have read seem to indicate
    that arithmetic encoding is in general more efficient than H-man
    encoding.
    
    
    --  Barry
828.6will this work?CADM::ROTHIf you plant ice you&#039;ll harvest windWed Feb 24 1988 08:2331
    It should be possible to search for an optimal encoding pretty
    easily, though it's not clear if a simple closed form expression
    exists.

    Suppose you encode the numbers in sets of words that look like this:

	1111111xxxx

    where the i'th set of words has m[i] leaading one's followed by a k[i]
    bit binary integer in the range [0,2^k[i]-2].  This set encodes 2^k[i]-1
    consecutive numbers in the range n[i] to n[i+1]-1.

    This encoding would be completely characterized by a nondecreasing
    sequence of k's, and we would have the relations

	n[i+1] = n[i]+2^k[k]-1	First number encoded in [i+1]'th word set
	m[i+1] = m[i]+k[i]	Number of leading 1's in [i+1]'th word set

    The probablility of the i'th word set being used would be

	p[i] = (2^k[i]-1)/((n[i]+1)*(n[i+1]+1))

    and you want to minimize the expected length of the code words

	E(m+k) = SUM(i > 0) p[i]*(m[i]+k[i])

    Just search for the sequence of k's by successively increasing the
    trailing values of k past the current index until an increase in E(m+k)
    is seen and going on to the next k in the sequence.

    - Jim
828.7Where do the one's end?IOSG::CARLINPerrin here, on greenSat Feb 27 1988 13:4136
re .6
    >    Suppose you encode the numbers in sets of words that look like this:
    >
    >	1111111xxxx
    >
    > where the i'th set of words has m[i] leaading one's followed by a k[i]
    >bit binary integer in the range [0,2^k[i]-2].  This set encodes 2^k[i]-1
    >consecutive numbers in the range n[i] to n[i+1]-1.
    >
    
    1. Don't you need a zero "stopper" to show where the sequences of
    ones end?
    
    2. Why not let the xxxx's run from 0 to 2^k[i]-1?
    
    3. isn't the optmal m[i] simply i?
    
    So the i'th set consists of numbers of the form:
    
    	111110xxxxxx
    
    	(i 1's / 0 / k-bit number)
    
    As you say the optimal k sequence would need an infinite look up
    table to encode the numbers. I chose:
    
    		k[0] = 0
    		k[i] = i-1 (i>0) in reply.2
    
    Jim, can you elaborate on that method you gave to find the optimal
    k sequence?
    
    Cheers
    
    Dick
    
828.8CLT::GILBERTBuilderSat Feb 27 1988 14:5848
    The Huffman (optimal) encoding can be achieved by the following
    algorithm (when there are only a finite number of characters).

    Let S = { (c0,p0), (c1,p1), ..., (cn,pn) } be the set of 'character',
    probability pairs.  If there is only one pair, encode that character
    with the null string, and we are done (except for backsubstitutions,
    below).

    Otherwise, there are at least two pairs in S, so choose two having the
    lowest probabilities, (cx,px) and (cy,py).  Create a new character cz.
    The encoding for cx will be the encoding for cz followed by "0", and 
    cy will be encoded by the the encoding for cz followed by "1".  Remove
    (cx,px), (cy,py) from S, and insert (cz,px+py) into S.  And repeat.


    For example, suppose we have:

	{ (a,1/2), (b,1/6), (c,1/12), (d,1/20), (e,1/30),
	  (f,1/30), (g,1/30), (h,1/30), (i,1/30), (j,1/30) }

    A tree shows the effect of running the above algorithm:

	          1
	        /   \
	     1/2     1/2
	           /      \
	        /            \
	  17/60                13/60
	  /   \                /   \
	1/6   7/60         1/12     2/15
	      /  \                 /    \
	     /    \               /      \
	  1/20    1/15        1/15        1/15
	       1/30  1/30  1/30  1/30  1/30  1/30

    Where each leaf represents one of the characters in the original set.
    We get the encoding:

	a (1/2)  =>  "0"
	b (1/6)  =>  "100"
	d (1/20) =>  "1010"
	e (1/30) =>  "10110"
	f (1/30) =>  "10111"
	c (1/12) =>  "110"
	g (1/30) =>  "11100"
	h (1/30) =>  "11101"
	i (1/30) =>  "11110"
	j (1/30) =>  "11111"
828.9SSDEVO::LARYSun Feb 28 1988 21:3533
In an infinite set like this, a most-probable-first scheme might work
better:

-	0 occurs 1/2 of the time so we should allocate 1/2 the "space"
	to it, so it gets the string "0".

-	1 and 2 occur 1/2 of the remainder of the time, so we should
	allocate 1/2 the remaining "space" to them, so 1 = 100 and 2 = 101

-	3, 4, 5 and 6 occur 1/2 of the remainder of the time, so we should
	allocate 1/2 of the remaining "space" to them, so:

		3 = 11000, 4 = 11001, 5 = 11010, 6 = 11011

-	In general, we map the interval [2^n-1,2^(n+1)-2] into:

	<n ones>0<n bit index within interval>

	since sigma(p(i)) over i in [2^n-1,2^(n+1)-2] equals 1/2^n

Now, what's the average length of this encoding?

The average length of encoding is:

(1/2)*1 + (1/4)*3 + (1/8)*5 + (1/16)*7 + ... + (1/2^n)*(2n-1) + ....

= 3

So you can encode these (unbounded) numbers into an average of 3 bits apiece!

Now - is this minimal? I thought so, when I started thinking about it,
but I don't like the fact that there are no encodings of length 2, 4, etc.

828.10a 1% improvementCLT::GILBERTSun Mar 20 1988 18:3216
    The following encoding uses an average of 657/220 bits per number.

	 0 "0"
	 1 "100"	11 "1100101"	21 "110000010"
	 2 "1010"	12 "1100111"	22 "110000011"
	 3 "1101"	13 "1111010"	23 "110011000"
	 4 "10110"	14 "11000000"	24 "110011001"
	 5 "11100"	15 "11001000"	25 "111011000"
	 6 "11111"	16 "11001001"	26 "111011001"
	 7 "110001"	17 "11001101"	27 "111011100"
	 8 "111010"	18 "11101101"	28 "111011101"
	 9 "111100"	19 "11101111"	29 "111101100"
	10 "1100001"	20 "11110111"	30 "111101101"
	31..62	"10111000000".."10111011111"
	63..126	"1011110000000".."1011110111111"
	...
828.112.98137 bits per numberCLT::GILBERTMon Mar 21 1988 14:0876
    Let the number n be encoded with a string of d(n) bits.  Let p(n) be
    the probability of the number n, and let sp(n) = sum(p(i),i>=n).
    Let's consider a depth-order traversal of the encoding tree.

			/  \
    		       0  /  \
			 /\  /\
			1 2 /\/\
			    ....

    At the top-most level, every number must have at least one bit in it,
    and so we add 1 to the average cost per number.  At the next level, we
    let d(0) = 1, and all the other numbers must have another bit in their
    representations; this increases the average cost by sp(1) = 1/2.

    At the next level, we have two 2-bit encodings available.  We use neither,
    increase the average cost by sp(1) again, and at the next level we have
    four 3-bit encodings available.

    We use two of these to encode 1 and 2 (so that d(1) = d(2) = 3).  All the
    following numbers are at a deeper layer, so we increase the cost by sp(3)
    = 1/4, and we have (4-2)*2 4-bit encodings available for use on the numbers
    3 and up.

    In general, we have m different b-bit encodings available to encode the
    numbers i and greater.  Let f(i,m,b) be the average cost of this.  If we
    use j of these b-bit encodings (to encode i thru i+j-1), we have:

	f(i,m,b) = sp(i+j) + f(i+j,(m-j)*2,b+1)

    This function is independent of the third parameter, so we write:

	f(i,m) =    min    sp(i+j) + f(i+j,2*(m-j))
		0 <= j < m

    This equation holds for encoding any infinite set of numbers.  For a finite
    set, we would allow j = m if sp(i+j) = 0; i.e., f(i,m) = 0 if sp(i+m)=0.

    It terms of the top-most level, we have one 0-bit string, and want to
    encode the numbers 0 and greater; we want f(0,1).

    The tree drawn above has:

	 f(0,1)	= sp(0) + f(0,2) = 1 + f(0,2)		(i.e., j=0)
		= sp(0) + sp(1) + f(1,2)		      (j=1)
		= sp(0) + sp(1) + sp(1) + f(1,4)	      (j=0)
		= sp(0) + sp(1) + sp(1) + sp(3) + f(3,4)      (j=2)

    In the problem, sp(n) = 1/(n+1), so this is 1 + 1/2 + 1/2 + 1/4 + f(3,4).


    The following seems to give a near-optimal value of f(0,1) for our problem.

	d(0) = 1
	d(1) = d(2) = 3
	d(3) = 4
	d(4) = 5
	d(5) = ... = d(7) = 6
	d(8) = ... = d(11) = 7
	d(12) = ... = d(17) = 8
	...

    f(0,1) = 1 + 1/2 + 1/2 + 1/4 + 1/5 + 1/6 + f(5,10).
    f(5,10) = 1/9 + f(8,14).   (j = 3)
    f(8,14) = 1/13 + f(12,20).  (j = 4)
    f(12,20) = 1/19 + f(18,28).  (j = 6)
    f(18,28) = 1/27 + f(26,40).  (j = 8)
    ...
    f(a,b) = sp(b-2) + f(b-2,2a+4).  (j = b-a-2)
    ...
    f(5*2^n-2,  7*2^n) = sp( 7*2^n-2) + f( 7*2^n-2, 10*2^n), j = 2*2^n
    f(7*2^n-2, 10*2^n) = sp(10*2^n-2) + f(10*2^n-2, 14*2^n), j = 3*2^n
    ...


    This gives an average encoding of about 2.98137 bits per number.
828.12Bounding the optimum cost from belowSSDEVO::LARYWed Mar 23 1988 16:1332
>	f(i,m) =    min    sp(i+j) + f(i+j,2*(m-j))
>		0 <= j < m

Finding the value of f(0,1) is difficult when sp(n) = 1/(n+1), as this is a
forward recurrence; however, if rigor is thrown to the winds you can claim
that:
	f(0,1) = lim     c(k),
		   k->inf			 /
						| 1/(n+1), n <= k
where c(k) is the value of f(0,1) when	sp(n) = |
						| 0, n > k
						 \
and that the limit is reached from below.

Some values of c(k) are:

	c(100) = 2.925034
	c(200) = 2.952711
	c(300) = 2.962501
	c(400) = 2.966932
	c(500) = 2.969769
	c(600) = 2.971856
	c(700) = 2.973080
	c(800) = 2.974121
	c(900) = 2.975006
	c(1000)= 2.975558
	c(1500)= 2.977481	
	c(2000)= 2.978456

So, 2.98137 looks pretty close to the optimum!

							Richie
828.13CLT::GILBERTTue Apr 05 1988 12:5018
    Some computer runs done by Richie Lary seem to indicate that for
    large values, j = m-i-2 doesn't seem to always minimize the f(i,m)
    in the recurrence:

	f(i,m) =    min    sp(i+j) + f(i+j,2*(m-j))
		0 <= j < m

    However, the j value always seems to be within 2 of m-i.  If we
    can prove that sp(i+j) + f(i+j,2*(m-j)) is concave, then we can
    optimize the calculations of f(i,m) by trying j=m-i and value
    near it.  The goal is to develop a closed form expression for
    the optimal average encoding (e.g., the 2.981... value).

    By concave, I mean that for g(j) = sp(i+j) + f(i+j,2*(m-j)), we have:

	p < q < J < r < s  implies  g(p) >= g(q) >= g(J) <= g(r) <= g(s)

    where J is a (or the) j value that minimizes g(j).
828.14Perfect coding.PBSVAX::COOPERTopher CooperTue Apr 12 1988 14:3148
    A "perfect" code would encode the i'th message with:
    
    		-lg(p ) = lg(1/p )
    		     i          i
    
    bits, where lg(x) is the logrithm to base 2 and p  is the probability
    						     i
    of the i'th message occurring.  The average code-length per message
    (the "information" in the code) is then:
    
    		sum  p  lg(1/p )
    		   i  i       i
    
    where the summation i is over all possible messages.
    
    Any practical code will have an average message length greater than
    or equal to this sum.
    
    For this particular problem, a bit of algebra gives us:
    
    		     inf  lg i
    		2 SUM    ------
    		     i=2  2
    			 i  - 1
    
    Unfortunately this converges very slowly, if at all.  I have tried
    to determine if it converges by various means.  The simplest test
    (the limit of the ratio of adjacent terms) is ambiguous (the ratio
    equals 1, which means it may or may not converge).
    
    The good news is, if it does diverge (which would mean that no
    practical code could actually exist for the complete series) it
    does so very slowly.  This means that practical codes can be created
    for truncated versions of the series with the truncation occuring
    for high i.  This would explain the seeming successes so far.
    
    If we truncate at i=100 the average code length will be 2.791; at
    i=200 2.861; at i=300 2.888 (it *looks* like its converging, but
    that may be deceiving).
    
    Anyone want to tackle proving whether or not the series converges?
    Setting a reasonable upper bound on what it converges to?
    
    (NOTE: The ideal can generally be approached more closely by coding
    n-tuples rather than single "messages".  As n increases the coding
    approaches the ideal as a limit).
    
    					Topher
828.15SSDEVO::LARYTue Apr 12 1988 20:2532
>    		     inf  lg i
>    		2 SUM    ------
>    		     i=2  2
>    			 i  - 1
>

This can be shown to converge by comparison with another series:

		     2          1.9
	For i > 16, i  - 1  >  i    

				      0.5
	For i > 16, lg i < sqrt(i) = i



			   lg i	       -1.4	       -x
Therefore, for i > 16,	  -------  <  i     , and SUM i   converges for x > 1
			   2
			  i  - 1

The sum of the series through i=1000000 is 2.9521+

					 -1.75			     -1.75 
For i > 1000000, each term is less than i    , and the integral of 2x     dx
from 1000000 to infinity is about .000036, so 2.9522 is a safe upper bound.

Is there a theorem that says you can get arbitrarily close to the minimum
entropy with a discrete encoding? If so, we need to backtrack.....

							Richie

828.16Think so, but not sure.PBSVAX::COOPERTopher CooperWed Apr 13 1988 11:538
    I think that there is such a theorem, but don't know for sure (I
    moved in January but my books are all still in boxes due to repairs
    on the house).  I'll check the library.  In any case it is certainly
    the "popular wisdom".  I would be very surprised if better encodings
    could not be accomplished by looking at n-tuples rather than
    singletons.
    
    						Topher
828.17It's a theorem.PBSVAX::COOPERTopher CooperWed Apr 13 1988 15:2619
    Checked "Principles of Digital Communication" by Das, Mullick and
    Chatterjee in the HLO library this lunch hour.  It is indeed a theorem
    and is known as Shannon's First Fundamental Theorem.  It is stated
    as:
    
    	Suppose that there are a zero-memory source S with the entropy
    	H(X) per symbol and a noiseless channel with a capacity C bits/
    	message.  Then it is possible to encode the output of S (with
    	the encoder delivering messages to the channel) so that all
    	information generated in S is transmitted through the channel
    	without loss if, and only if C >= H(X).
    
    (page 537).  The case where the transmission can occur when C =
    H(X) is the case we've been talking about.
    
    The proof is essentially based on encoding n-tuples, with the limit
    as n -> infinity.
    
    					Topher
828.18Intro to arithmetic codingSSDEVO::LARYMon May 02 1988 07:1672
The reason why all of our proposed encodings are a bit worse than the
theoretical minimum seems to lie in the nature of binary encoding schemes.
Whenever we "group" a series of consecutive numbers into a set of codes of
the same length, we ignore the small differences in probability of
occurrence among the numbers in the set. This destroys information and
increases the encoded length.

What we need is "fractional bit" encodings. Do they exist? Of course they do!

Here's a neat one known as "Arithmetic coding", cribbed from some published
papers by Ian Witten and John Cleary of the University of Calgary, and extended
from finite alphabets to denumerable ones (a la the base note):

Given:	A denumerable set of characters X with P(n) = the occurrence probability
	of the nth char, and a sequence x(i) of characters from X

		   n-1
	Set S(n) = sum P(n)
		    1

	Set L=0, H=1

	For each x(i) in the sequence, find its index n in the character set
	and set L = L + (H-L)*S(n), H = L + (H-L)*S(n+1).

	When you've processed all the x(i), ANY number in the semi-open
	interval [L,H) is an encoding of the sequence.

An example helps here. Lets say the set X is {a,b,c},

	P(1) = 0.5	S(1) = 0
	P(2) = 0.2	S(2) = 0.5
	P(3) = 0.3	S(3) = 0.7
			S(4) = 1.0

To encode the string "abcba"

Start:			[L,H) = [0,1)
Process a, n=1:		[L,H) = [0,0.5)
Process b, n=2:		[L,H) = [0.25,0.35)
Process c, n=3:		[L,H) = [0.32,0.35)
Process b, n=2:		[L,H) = [0.335,0.341)
Process a, n=1:		[L,H) = [0,335,0.338)

What we are doing here is giving each character a "chunk" of the interval
[0,1); processing a character consists of extracting that character's
prorated "chunk" of the current interval and making that the new interval.

At first glance, it looks like the encoded result is infinite in size,
but it isn't; there is a number in the interval that can be represented
in at most [1-log2(H-L)] bits. In the example above the encoded
string in binary is 0101011, representing the number .0101011 which is
43/128 = .3359375. The decoding algorithm is left as an exercise for
the reader (I'm lazy).

A Huffman code based on the occurrence probabilities of a, b, and c would
assign a=0, b=10, c=11 and come up with the encoding 01011100, one bit longer.

Arithmetic coding always naturally produces minimum-entropy encodings;
that is, the limit (as m->infinity) of the encoding size of m-element
sequences generated according to the occurrence probabilities P(n), divided
by m, is sum(-P(n)*log2(P(n)). Huffman codings can approximate minimum-entropy
encodings arbitrarily closely, but the tables and encoding algorithms become
arbitrary large and complex in the process.

An approximation to the above arithmetic coding algorithm can be represented
in computers with almost no loss of encoding efficiency, by accumulating
L and H in registers with at least twice the precision of the S(n) plus
a few guard bits. Whenever the values of L and H have any high-order bits
in common these bits are output (since all values in [L,H) will have identical
values in these bits) and L and H are scaled to remove these high-order bits.
(Its a little more complicated than this, but not much).