[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

828.0. "Integer data compression:" by CHOVAX::YOUNG (Back from the Shadows Again,) Sun Feb 21 1988 15:05

    OK, heres an interesting question:
    
    I wish to store, as a series of bits, an ordered sequence of 'random'
    non-negative (zero to infinity) integers.  The distribution of these
    integers is definined by:
    
    	P(x > n) = 1/(n+2)
    or:
    	The probabilty that any integer 'x' in the sequense, is greater
    than some arbitray integer 'n' is equal to '1/(n+2)'.
    
    Thus:
    	P(0)	=	      1/2	Half of the numbers are 0.
    	P(1)	= 1/2 - 1/3 = 1/6	1/6th are 1.
    	P(2)	= 1/3 - 1/4 = 1/12
     	p(3)	= 1/4 - 1/5 = 1/20
    	...
    or:				      2
    	P(n)	= 1/n+1 - 1/n+2 = 1/(n +3n + 2)
    
    Now the question is, what is the most space effcient way to represent
    such a sequence?  I guess I should rather call it a challenge to
    see who can come up with the best solution, because I am not certain
    that there is a known solution to this problem.

    --  Barry

T.R	Title	User	Personal Name	Date	Lines
828.1		CADM::ROTH	If you plant ice you'll harvest wind	`Mon Feb 22 1988 07:14`	14
	If you could encode the data in such a way that entropy was minimized then that would be optimal. Since you know the probablilities, simple Huffman encoding would give the best bit stream I'd think (assuming that you are using bit streams and not more general symbols.) Early in the sequence the tree would branch all to one side, since for example zero should have a leaf all by itself right off of the topmost branch. But far into the series the probablility of choosing one branch or another would average out (since (n+1)/(n+2) -> 1), and the subtrees should be asymptotically balanced. I'm too lazy to work out the details right now though. - Jim
828.2	Is this Huffman?	IOSG::CARLIN	you bet he doesn't drink Black Label	`Tue Feb 23 1988 16:38`	37
	Try this as a marker to judge other solutions by. 0 => 0 1 => 10 2,3 => 1100,1101 4,5,6,7 => 111000,111001,111010,111011 8,9,10,11,12,13,14,15 => 11110000,......... so mean length = (1/1-1/2).1 +(1/2-1/3).2 +(1/3-1/4).4 +(1/4-1/5).4 +(1/5-1/6).6 +.................+(1/8-1/9).6 +... = (1/1-1/2).1 +(1/2-1/3).2 +(1/3-1/5).4 +(1/5-1/9).6 +... _ r = 1/2 + 2\ 1/(2 +1) /_ r=0 r r Taking 1/(2 ) as an upper bound for 1/(2 +1) after a few terms gives an upper bound for the mean length of: 1/2 + 2(1/2+1/3+1/5+1/9) + 2(1/8) Which I make 3.04 by hurried calculation. Seems quite good, in fact too good. Where's the mistake? Dick
828.3		CLT::GILBERT	Builder	`Tue Feb 23 1988 17:06`	12
	> I wish to store, as a series of bits, an ordered sequence of 'random' > non-negative (zero to infinity) integers. The distribution [...] Are the numbers in the sequence distinct? If not, then it seems worthwhile to include the 'multiplicity' of the "0" in the sequence. If the numbers are distinct, then you can easily improve on the Huffman encoding (of the previous reply) by making use of the fact that the numbers are in ascending order. The first bit indicates whether "0" is in the sequence, and use Huffman encoding on the remaining numbers. This decreases the encoding by about one bit per number.
828.4		SSDEVO::LARY		`Tue Feb 23 1988 19:31`	4
	If the numbers are in ascending order, then encode the difference between numbers rather than the numbers themselves; since encoded length increases with magnitude, this can only win!
828.5	Its just a 'sequence'.	CHOVAX::YOUNG	Back from the Shadows Again,	`Wed Feb 24 1988 01:02`	22
	Re .3, .4: Sorry, I guess "ordered series" is a little redundant. I only mean that <0,3,0,1> is distinct from <0,0,1,3>. As opposed to say, sets where order is unimportant. I did NOT mean that the integers were sorted or necessarily ascending. Its just a sequence of integers. Re .1: I am far from an expert with Huffman(sp?) encoding algorithims, and I must confess that you have lost me after the word "..tree..". Guess I could look it up though. However what little I do know leads me to 2 additional questions, 1) H-man encoding as I have heard it described seems to be intended only to address encoding data from a finite range of values, NOT from an infinite range as this problem outlines. Perhaps I just need a more explicit description of your approach? and 2) Papers that I have read seem to indicate that arithmetic encoding is in general more efficient than H-man encoding. -- Barry
828.6	will this work?	CADM::ROTH	If you plant ice you'll harvest wind	`Wed Feb 24 1988 08:23`	31
	It should be possible to search for an optimal encoding pretty easily, though it's not clear if a simple closed form expression exists. Suppose you encode the numbers in sets of words that look like this: 1111111xxxx where the i'th set of words has m[i] leaading one's followed by a k[i] bit binary integer in the range [0,2^k[i]-2]. This set encodes 2^k[i]-1 consecutive numbers in the range n[i] to n[i+1]-1. This encoding would be completely characterized by a nondecreasing sequence of k's, and we would have the relations n[i+1] = n[i]+2^k[k]-1 First number encoded in [i+1]'th word set m[i+1] = m[i]+k[i] Number of leading 1's in [i+1]'th word set The probablility of the i'th word set being used would be p[i] = (2^k[i]-1)/((n[i]+1)(n[i+1]+1)) and you want to minimize the expected length of the code words E(m+k) = SUM(i > 0) p[i](m[i]+k[i]) Just search for the sequence of k's by successively increasing the trailing values of k past the current index until an increase in E(m+k) is seen and going on to the next k in the sequence. - Jim
828.7	Where do the one's end?	IOSG::CARLIN	Perrin here, on green	`Sat Feb 27 1988 13:41`	36
	re .6 > Suppose you encode the numbers in sets of words that look like this: > > 1111111xxxx > > where the i'th set of words has m[i] leaading one's followed by a k[i] >bit binary integer in the range [0,2^k[i]-2]. This set encodes 2^k[i]-1 >consecutive numbers in the range n[i] to n[i+1]-1. > 1. Don't you need a zero "stopper" to show where the sequences of ones end? 2. Why not let the xxxx's run from 0 to 2^k[i]-1? 3. isn't the optmal m[i] simply i? So the i'th set consists of numbers of the form: 111110xxxxxx (i 1's / 0 / k-bit number) As you say the optimal k sequence would need an infinite look up table to encode the numbers. I chose: k[0] = 0 k[i] = i-1 (i>0) in reply.2 Jim, can you elaborate on that method you gave to find the optimal k sequence? Cheers Dick
828.8		CLT::GILBERT	Builder	`Sat Feb 27 1988 14:58`	48
	The Huffman (optimal) encoding can be achieved by the following algorithm (when there are only a finite number of characters). Let S = { (c0,p0), (c1,p1), ..., (cn,pn) } be the set of 'character', probability pairs. If there is only one pair, encode that character with the null string, and we are done (except for backsubstitutions, below). Otherwise, there are at least two pairs in S, so choose two having the lowest probabilities, (cx,px) and (cy,py). Create a new character cz. The encoding for cx will be the encoding for cz followed by "0", and cy will be encoded by the the encoding for cz followed by "1". Remove (cx,px), (cy,py) from S, and insert (cz,px+py) into S. And repeat. For example, suppose we have: { (a,1/2), (b,1/6), (c,1/12), (d,1/20), (e,1/30), (f,1/30), (g,1/30), (h,1/30), (i,1/30), (j,1/30) } A tree shows the effect of running the above algorithm: 1 / \ 1/2 1/2 / \ / \ 17/60 13/60 / \ / \ 1/6 7/60 1/12 2/15 / \ / \ / \ / \ 1/20 1/15 1/15 1/15 1/30 1/30 1/30 1/30 1/30 1/30 Where each leaf represents one of the characters in the original set. We get the encoding: a (1/2) => "0" b (1/6) => "100" d (1/20) => "1010" e (1/30) => "10110" f (1/30) => "10111" c (1/12) => "110" g (1/30) => "11100" h (1/30) => "11101" i (1/30) => "11110" j (1/30) => "11111"
828.9		SSDEVO::LARY		`Sun Feb 28 1988 21:35`	33
	In an infinite set like this, a most-probable-first scheme might work better: - 0 occurs 1/2 of the time so we should allocate 1/2 the "space" to it, so it gets the string "0". - 1 and 2 occur 1/2 of the remainder of the time, so we should allocate 1/2 the remaining "space" to them, so 1 = 100 and 2 = 101 - 3, 4, 5 and 6 occur 1/2 of the remainder of the time, so we should allocate 1/2 of the remaining "space" to them, so: 3 = 11000, 4 = 11001, 5 = 11010, 6 = 11011 - In general, we map the interval [2^n-1,2^(n+1)-2] into: <n ones>0<n bit index within interval> since sigma(p(i)) over i in [2^n-1,2^(n+1)-2] equals 1/2^n Now, what's the average length of this encoding? The average length of encoding is: (1/2)1 + (1/4)3 + (1/8)5 + (1/16)7 + ... + (1/2^n)*(2n-1) + .... = 3 So you can encode these (unbounded) numbers into an average of 3 bits apiece! Now - is this minimal? I thought so, when I started thinking about it, but I don't like the fact that there are no encodings of length 2, 4, etc.
828.10	a 1% improvement	CLT::GILBERT		`Sun Mar 20 1988 18:32`	16
	The following encoding uses an average of 657/220 bits per number. 0 "0" 1 "100" 11 "1100101" 21 "110000010" 2 "1010" 12 "1100111" 22 "110000011" 3 "1101" 13 "1111010" 23 "110011000" 4 "10110" 14 "11000000" 24 "110011001" 5 "11100" 15 "11001000" 25 "111011000" 6 "11111" 16 "11001001" 26 "111011001" 7 "110001" 17 "11001101" 27 "111011100" 8 "111010" 18 "11101101" 28 "111011101" 9 "111100" 19 "11101111" 29 "111101100" 10 "1100001" 20 "11110111" 30 "111101101" 31..62 "10111000000".."10111011111" 63..126 "1011110000000".."1011110111111" ...
828.11	2.98137 bits per number	CLT::GILBERT		`Mon Mar 21 1988 14:08`	76
	Let the number n be encoded with a string of d(n) bits. Let p(n) be the probability of the number n, and let sp(n) = sum(p(i),i>=n). Let's consider a depth-order traversal of the encoding tree. / \ 0 / \ /\ /\ 1 2 /\/\ .... At the top-most level, every number must have at least one bit in it, and so we add 1 to the average cost per number. At the next level, we let d(0) = 1, and all the other numbers must have another bit in their representations; this increases the average cost by sp(1) = 1/2. At the next level, we have two 2-bit encodings available. We use neither, increase the average cost by sp(1) again, and at the next level we have four 3-bit encodings available. We use two of these to encode 1 and 2 (so that d(1) = d(2) = 3). All the following numbers are at a deeper layer, so we increase the cost by sp(3) = 1/4, and we have (4-2)2 4-bit encodings available for use on the numbers 3 and up. In general, we have m different b-bit encodings available to encode the numbers i and greater. Let f(i,m,b) be the average cost of this. If we use j of these b-bit encodings (to encode i thru i+j-1), we have: f(i,m,b) = sp(i+j) + f(i+j,(m-j)2,b+1) This function is independent of the third parameter, so we write: f(i,m) = min sp(i+j) + f(i+j,2(m-j)) 0 <= j < m This equation holds for encoding any infinite set of numbers. For a finite set, we would allow j = m if sp(i+j) = 0; i.e., f(i,m) = 0 if sp(i+m)=0. It terms of the top-most level, we have one 0-bit string, and want to encode the numbers 0 and greater; we want f(0,1). The tree drawn above has: f(0,1) = sp(0) + f(0,2) = 1 + f(0,2) (i.e., j=0) = sp(0) + sp(1) + f(1,2) (j=1) = sp(0) + sp(1) + sp(1) + f(1,4) (j=0) = sp(0) + sp(1) + sp(1) + sp(3) + f(3,4) (j=2) In the problem, sp(n) = 1/(n+1), so this is 1 + 1/2 + 1/2 + 1/4 + f(3,4). The following seems to give a near-optimal value of f(0,1) for our problem. d(0) = 1 d(1) = d(2) = 3 d(3) = 4 d(4) = 5 d(5) = ... = d(7) = 6 d(8) = ... = d(11) = 7 d(12) = ... = d(17) = 8 ... f(0,1) = 1 + 1/2 + 1/2 + 1/4 + 1/5 + 1/6 + f(5,10). f(5,10) = 1/9 + f(8,14). (j = 3) f(8,14) = 1/13 + f(12,20). (j = 4) f(12,20) = 1/19 + f(18,28). (j = 6) f(18,28) = 1/27 + f(26,40). (j = 8) ... f(a,b) = sp(b-2) + f(b-2,2a+4). (j = b-a-2) ... f(52^n-2, 72^n) = sp( 72^n-2) + f( 72^n-2, 102^n), j = 22^n f(72^n-2, 102^n) = sp(102^n-2) + f(102^n-2, 142^n), j = 3*2^n ... This gives an average encoding of about 2.98137 bits per number.
828.12	Bounding the optimum cost from below	SSDEVO::LARY		`Wed Mar 23 1988 16:13`	32
	> f(i,m) = min sp(i+j) + f(i+j,2*(m-j)) > 0 <= j < m Finding the value of f(0,1) is difficult when sp(n) = 1/(n+1), as this is a forward recurrence; however, if rigor is thrown to the winds you can claim that: f(0,1) = lim c(k), k->inf / \| 1/(n+1), n <= k where c(k) is the value of f(0,1) when sp(n) = \| \| 0, n > k \ and that the limit is reached from below. Some values of c(k) are: c(100) = 2.925034 c(200) = 2.952711 c(300) = 2.962501 c(400) = 2.966932 c(500) = 2.969769 c(600) = 2.971856 c(700) = 2.973080 c(800) = 2.974121 c(900) = 2.975006 c(1000)= 2.975558 c(1500)= 2.977481 c(2000)= 2.978456 So, 2.98137 looks pretty close to the optimum! Richie
828.13		CLT::GILBERT		`Tue Apr 05 1988 11:50`	18
	Some computer runs done by Richie Lary seem to indicate that for large values, j = m-i-2 doesn't seem to always minimize the f(i,m) in the recurrence: f(i,m) = min sp(i+j) + f(i+j,2(m-j)) 0 <= j < m However, the j value always seems to be within 2 of m-i. If we can prove that sp(i+j) + f(i+j,2(m-j)) is concave, then we can optimize the calculations of f(i,m) by trying j=m-i and value near it. The goal is to develop a closed form expression for the optimal average encoding (e.g., the 2.981... value). By concave, I mean that for g(j) = sp(i+j) + f(i+j,2*(m-j)), we have: p < q < J < r < s implies g(p) >= g(q) >= g(J) <= g(r) <= g(s) where J is a (or the) j value that minimizes g(j).
828.14	Perfect coding.	PBSVAX::COOPER	Topher Cooper	`Tue Apr 12 1988 13:31`	48
	A "perfect" code would encode the i'th message with: -lg(p ) = lg(1/p ) i i bits, where lg(x) is the logrithm to base 2 and p is the probability i of the i'th message occurring. The average code-length per message (the "information" in the code) is then: sum p lg(1/p ) i i i where the summation i is over all possible messages. Any practical code will have an average message length greater than or equal to this sum. For this particular problem, a bit of algebra gives us: inf lg i 2 SUM ------ i=2 2 i - 1 Unfortunately this converges very slowly, if at all. I have tried to determine if it converges by various means. The simplest test (the limit of the ratio of adjacent terms) is ambiguous (the ratio equals 1, which means it may or may not converge). The good news is, if it does diverge (which would mean that no practical code could actually exist for the complete series) it does so very slowly. This means that practical codes can be created for truncated versions of the series with the truncation occuring for high i. This would explain the seeming successes so far. If we truncate at i=100 the average code length will be 2.791; at i=200 2.861; at i=300 2.888 (it looks like its converging, but that may be deceiving). Anyone want to tackle proving whether or not the series converges? Setting a reasonable upper bound on what it converges to? (NOTE: The ideal can generally be approached more closely by coding n-tuples rather than single "messages". As n increases the coding approaches the ideal as a limit). Topher
828.15		SSDEVO::LARY		`Tue Apr 12 1988 19:25`	32
	> inf lg i > 2 SUM ------ > i=2 2 > i - 1 > This can be shown to converge by comparison with another series: 2 1.9 For i > 16, i - 1 > i 0.5 For i > 16, lg i < sqrt(i) = i lg i -1.4 -x Therefore, for i > 16, ------- < i , and SUM i converges for x > 1 2 i - 1 The sum of the series through i=1000000 is 2.9521+ -1.75 -1.75 For i > 1000000, each term is less than i , and the integral of 2x dx from 1000000 to infinity is about .000036, so 2.9522 is a safe upper bound. Is there a theorem that says you can get arbitrarily close to the minimum entropy with a discrete encoding? If so, we need to backtrack..... Richie
828.16	Think so, but not sure.	PBSVAX::COOPER	Topher Cooper	`Wed Apr 13 1988 10:53`	8
	I think that there is such a theorem, but don't know for sure (I moved in January but my books are all still in boxes due to repairs on the house). I'll check the library. In any case it is certainly the "popular wisdom". I would be very surprised if better encodings could not be accomplished by looking at n-tuples rather than singletons. Topher
828.17	It's a theorem.	PBSVAX::COOPER	Topher Cooper	`Wed Apr 13 1988 14:26`	19
	Checked "Principles of Digital Communication" by Das, Mullick and Chatterjee in the HLO library this lunch hour. It is indeed a theorem and is known as Shannon's First Fundamental Theorem. It is stated as: Suppose that there are a zero-memory source S with the entropy H(X) per symbol and a noiseless channel with a capacity C bits/ message. Then it is possible to encode the output of S (with the encoder delivering messages to the channel) so that all information generated in S is transmitted through the channel without loss if, and only if C >= H(X). (page 537). The case where the transmission can occur when C = H(X) is the case we've been talking about. The proof is essentially based on encoding n-tuples, with the limit as n -> infinity. Topher
828.18	Intro to arithmetic coding	SSDEVO::LARY		`Mon May 02 1988 06:16`	72
	The reason why all of our proposed encodings are a bit worse than the theoretical minimum seems to lie in the nature of binary encoding schemes. Whenever we "group" a series of consecutive numbers into a set of codes of the same length, we ignore the small differences in probability of occurrence among the numbers in the set. This destroys information and increases the encoded length. What we need is "fractional bit" encodings. Do they exist? Of course they do! Here's a neat one known as "Arithmetic coding", cribbed from some published papers by Ian Witten and John Cleary of the University of Calgary, and extended from finite alphabets to denumerable ones (a la the base note): Given: A denumerable set of characters X with P(n) = the occurrence probability of the nth char, and a sequence x(i) of characters from X n-1 Set S(n) = sum P(n) 1 Set L=0, H=1 For each x(i) in the sequence, find its index n in the character set and set L = L + (H-L)S(n), H = L + (H-L)S(n+1). When you've processed all the x(i), ANY number in the semi-open interval [L,H) is an encoding of the sequence. An example helps here. Lets say the set X is {a,b,c}, P(1) = 0.5 S(1) = 0 P(2) = 0.2 S(2) = 0.5 P(3) = 0.3 S(3) = 0.7 S(4) = 1.0 To encode the string "abcba" Start: [L,H) = [0,1) Process a, n=1: [L,H) = [0,0.5) Process b, n=2: [L,H) = [0.25,0.35) Process c, n=3: [L,H) = [0.32,0.35) Process b, n=2: [L,H) = [0.335,0.341) Process a, n=1: [L,H) = [0,335,0.338) What we are doing here is giving each character a "chunk" of the interval [0,1); processing a character consists of extracting that character's prorated "chunk" of the current interval and making that the new interval. At first glance, it looks like the encoded result is infinite in size, but it isn't; there is a number in the interval that can be represented in at most [1-log2(H-L)] bits. In the example above the encoded string in binary is 0101011, representing the number .0101011 which is 43/128 = .3359375. The decoding algorithm is left as an exercise for the reader (I'm lazy). A Huffman code based on the occurrence probabilities of a, b, and c would assign a=0, b=10, c=11 and come up with the encoding 01011100, one bit longer. Arithmetic coding always naturally produces minimum-entropy encodings; that is, the limit (as m->infinity) of the encoding size of m-element sequences generated according to the occurrence probabilities P(n), divided by m, is sum(-P(n)*log2(P(n)). Huffman codings can approximate minimum-entropy encodings arbitrarily closely, but the tables and encoding algorithms become arbitrary large and complex in the process. An approximation to the above arithmetic coding algorithm can be represented in computers with almost no loss of encoding efficiency, by accumulating L and H in registers with at least twice the precision of the S(n) plus a few guard bits. Whenever the values of L and H have any high-order bits in common these bits are output (since all values in [L,H) will have identical values in these bits) and L and H are scaled to remove these high-order bits. (Its a little more complicated than this, but not much).