[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

203.0. "A,B,C wth no dup. substrings" by SPRITE::OSMAN () Thu Jan 03 1985 17:51

Given the three characters A, B, and C, how long a string can you make
with no duplicate consecutive substrings ?  Prove your answer.

With only TWO characters A and B, we can start with A, we're not allowed
to grow to AA because A is duplicated, so we grow to AB.  Then we're forced
to grow to ABA.  At this point we're stuck because ABAA is a duplicated A and
ABAB is a duplicated AB.

With three, we can start with ABACBC . . . but how far can we go ?  Forever ?
Note that it is possible to get stuck but that doesn't mean you're done.
For instance, ABACABA looks complete since all three possible next letters
respectively cause a duplicated A, AB, and ABAC.  However, nobody said you
have to take this path !  (nobody said you have to waste time with this
problem either; reminds me of my brother to whom I once said [as I so often
do to people, being a lover of puzzles] "Willy, here's a problem for you.
Suppose . . ." at the end of which he replied "Eric that may be a problem
for you, it's no problem for me !")

T.R	Title	User	Personal Name	Date	Lines
203.1		TURTLE::GILBERT		`Thu Jan 03 1985 23:31`	23
	The answer is "forever". Here is the beginning of the lexigraphically first such sequence. It's broken up into pieces lines for readability. ABACABCACBABC ABACABCACBAC ABACBABC ABACABCACBABC ABACBABCACBAC ABACBABC ABACABCACBABCBAC ABACBABC ABACABCACBAC ABACBABC ABACABCBABC ABACBABCACBAC ABACBABC ABACABCACBABC ABACBABCACBAC ABACBABC ABACABCBABC ABACBABCACBAC ABACBABC ABACBCABCBABC ABACABCACBABC ABACABCBABC ABACBABC Note that the same pieces reoccur again and again (this, of course, must happen in an infinite sequence). Here is a constructive proof showing that such a sequence can be infinite. Consider the three pieces: ABACABCACBABC, ABACABCACBAC, and ABACBABC. Note that none of these pieces can be found in a concatenation of the pieces, except as an exact match with itself. Furthermore, any two different pieces can be placed next to each other. Thus, these pieces behave like the original pieces, A, B, AND C. So, take any sequence satisfying the conditions, and replace each occurence of A, B and C by ABACABCACBABC, ABACABCACBAC, and ABACBABC, respectively. This sequence also satisfies the conditions of the problem. QED.
203.2		SPRITE::OSMAN		`Thu Sep 12 1985 16:29`	16
	I'm not convinced. Using your construction, it seems possible to me for some duplicate repeating substring to appear which starts in the MIDDLE of one of your sequences and ends in the MIDDLE of another. For instance, consider [----------------] [-------------] [-----------------] [***********][*************] The dashed sections represent portions generated by your method. The starred lines located parts of your string that seem to me could be a duplicated substring. So it's not clear you've proved it yet. /Eric
203.3		TURTLE::GILBERT		`Fri Sep 13 1985 09:27`	6
	No, I think that that can't happen. However, the following can and does: [-----------------][--------------] [****][****] Thanks for pointing this out. Any ideas?
203.4		BEING::POSTPISCHIL		`Fri Sep 13 1985 10:04`	9
	Re .3: Why do you say the situation posed in .2 cannot happen? The situation given in .3 does occur: ABACABCACBAC ABACABCACBABC contains ACAB ACAB. (Start with the last A of the first string.) -- edp
203.5		TURTLE::GILBERT		`Sat Sep 14 1985 19:13`	84
	Here is an outline of a proof. Suppose we have strings s , s ,..., s , with each string composed of characters 1 2 n in the alphabet {A,B,C}. Furthermore, suppose each s can be written as i s = t t , where the t are able to uniquely identify the s , in the i i,1 i,2 i i following sense. A substring t of s uniquely identifies the s , if, in any i i concatenation of the s, (s \|s \|...\|s ), containing exactly one occurrence 1 2 n of s , there is exactly one occurence of t. i Let us suppose a string z of (s \|s \|...\|s ), contains a repeated, contiguous 1 2 n subsequence of characters: z = z y y z , where the length of y is greater than 1 2 the length of any of the s . Now we can show that z must contain a repeated, i contiguous subsequence of characters: z = z x x z , where x is a concatenation 3 4 of the s, and is longer than any of the s. Suppose that the `split' between the ys occurs in s , splitting it as follows: i <--t(i,1)--><--t(i,2)--> <--------------y--------------><--------------y--------------> Now because the y are equal, we know that the right-most part of y contains an occurrence of t(i,1), and so we are able to infer the next few characters following the second y, and can find the desired x: <--t(i,1)--><--t(i,2)--> ... <--t(i,1)--><--t(i,2)--> <--------------y--------------><--------------y--------------> <--------------x--------------><--------------x--------------> If the `split' occurs in a t(i,1), the demonstration is similar, and we extend the first y leftward. Now suppose we can find 3 such s , with uniquely identifying substrings t , i i,j and suppose further that we are able to show that no string of (s \|s \|...!s ), 1 2 n with no s being adjacent to itself, contains a repeated, contiguous substring i of length less than the longest s . Then, we will be able to show that there i exists an abitrarily long string of {A,B,C}, with no repeated, contiguous substring, since, given any such string of {A,B,C}, we can produce a longer one by replacing each A, B, and C with s , s , and s , respectively (this 1 2 3 longer string is shown to contain to repeated contiguous substring, by the above results, and a proof by contradiction). By the infinity lemma, this shows that there exists an infinitely long string of {A,B,C} containing no repeated, contiguous substrings. The problem, then, is to find such s . Suppose we are presented with strings, i and asked to verify that they satisfy the conditions. This can be done in a finite amount of time by only considering all strings of (s \|s \|...\|s ) with 1 2 n length less than 4 times the longest s , since this will suffice to find all i ways that the s could be combined to form contiguous duplications of t , or i,j any other repeated, contiguous substrings shorter than the longest s: <---s---> ... <---s---> <---t---><---t---> Now, to find such s.
203.6		BEING::POSTPISCHIL		`Mon Sep 16 1985 09:24`	19
	Re .5: I think "the length of y is greater than the length of any of the s[i]" should be changed to ". . . greater than or equal to . . .". To help with finding appropriate strings, I have a program which prints out non-null strings of (A+B+C)* which contain no repeated substrings. It may be found in BEING""::[POSTPISCHIL.PUB]NR.EXE. It will print strings in order of increasing length, but not in lexical order. There is currently no provision for stopping it if it does not run out of memory and there are an infinite number of such strings, so either run it on a hardcopy terminal or "DEFINE/USER SYS$OUTPUT filename" and stop it after awhile. If anybody wants to work with the source to have it do other things, it is in NR.C, although there are no comments. If anybody wishes, I will add comments shortly. -- edp
203.7		TURTLE::GILBERT		`Mon Sep 16 1985 10:51`	286
	Here are the first 22528 characters of the 'lexically first' string. 22528: abacabcacbabcabacabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcacbabc bacabacbabcabacabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabac babcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabaca bcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabc acbacabacbabcabacbcabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbab cabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabca cbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcba bcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbab cabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcac babcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcac babcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacb abcbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabc abacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcac babcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacb abcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabc abacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabca cbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcab acbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacb acabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbab cabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabc abacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacb acabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabc abacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabc abacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabca bacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacb abcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacba bcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabc acbabcabacbabcbacabacbabcabacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacb abcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacba bcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbab cacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabca cbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcba bcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbab cabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcac babcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcac babcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacb abcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabc abacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabca cbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcab acbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacb acabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacab acbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacbabcacbaca bacbabcabacabcacbabcabacbabcacbacabcacbabcabacabcacbabcbacabacbabcabacabcacbabca bacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcabacabcacbabcabacbcabcbabcab acabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcaba cabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbaca bacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcaba cabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcaba cbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabac abcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbab cacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcaba cbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabac babcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacb abcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacab cacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabc babcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacb abcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbab cabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabc abacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcaba cabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbab cabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabc abacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcaba cbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacba cabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacb cabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbaca bacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcab acabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcaba cabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbaca bacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcaba cabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcaba cbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabac abcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabc abacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabca cbacabacbabcabacbcabcbabcabacabcacbabcabacbabcacbacabacbabcabacabcacbabcabacbabc acbacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcac bacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabc abacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbaca bacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcac bacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcaba cbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacba cabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcbabcabacbabcacb acabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabac bcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbac abacbabcabacabcbabcabacbabcacbacabacbabcabacbcacbabcabacabcacbabcabacbabcacbacab acbabcabacabcacbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabca bacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabac babcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabaca bcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcab acabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcab acbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcaba cabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacba bcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacab acbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcab cbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabac babcabacabcbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabc abacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabca cbacabacbabcabacabcacbabcabacbabcbacabacbabcabacabcacbabcabacbcabcbabcabacabcacb abcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabc abacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabca bacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbc abacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacb acabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabca bacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacab acbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacb acabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabac bcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbac abacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabca bacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabac babcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabaca bcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbabcabacabc babcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabacabca cbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcabacabcacba bcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbab cacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcb abcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabacabcac babcabacabcbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabca cbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabc acbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcba bcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacba cabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabc acbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabca bacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcac bacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacba bcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabca bacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcab acabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbabcabac abcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabaca bcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcabacabca cbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabac babcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabaca bcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabc acbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabac babcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcb abcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacba bcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabc acbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcb abcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacba bcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabca cbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabc acbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcba bcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacb cabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbaca bacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcab acbabcbacabacbabcabacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbac abacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbc abcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabaca bcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabaca bcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacab cacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabc babcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcaba cabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacba bcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabc acbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacab cbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabac babcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbab cacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabca bacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcab acbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbac abacbabcabacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabca bacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcab acabcacbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabcacbabcabacabcacbabc bacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcaba cabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbab cabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabc abacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcaba cbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacba cabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacb cabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbaca bacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcab acabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcaba cabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbaca bacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcaba cabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcaba cbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabac abcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabc abacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabca cbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacb abcabacbabcbacabacbabcabacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabc acbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabca bacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcac bacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacba bcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabca bacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcab acabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbab cabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbab cabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabc abacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcaba cbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacba cabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacb cabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbaca bacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacb abcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacbabcacbacabac babcabacabcacbabcabacbabcacbacabcacbabcabacabcacbabcbacabacbabcabacabcacbabcabac babcacbacabacbabcabacabcacbabcabacbabcbacabacbabcabacabcacbabcabacbcabcbabcabaca bcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacab cbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabac babcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacab cacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacba bcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabc babcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcac bacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacba bcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbab cabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabc acbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcac babcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbab cabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabc abacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbabcab acabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcaba cabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcabacab cacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcab acbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcaba cabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacba bcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacab acbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcab cbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabac babcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabaca bcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacab cbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabac babcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacab cacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacba bcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabc babcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcaba cbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcab acabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcab acabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcaba cabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabac abcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabc abacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcaba cbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabac abcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcaba cabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbaca bacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabac babcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcba bcabacabcacbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabcacbabcabacabcac babcbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbab cabacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabca cbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcac babcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbab cabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabc acbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabca bacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcac bacabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcba bcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbab cabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcac bacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbab cabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbab cabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabc abacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabac babcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacb abcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacab cacbabcabacbabcbacabacbabcabacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabac babcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacb abcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacba bcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabc acbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcb abcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacba bcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabca cbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabca cbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcac babcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbab cabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabc acbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabca bacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcac bacabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbaca bacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacbabcacbac abacbabcabacabcacbabcabacbabcacbacabcacbabcabacabcacbabcbacabacbabcabacabcacbabc abacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcabacabcacbabcabacbcabcbabca bacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcab acabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbac abacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcab acabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcab acbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcaba cabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacba bcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcab acbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacaba cbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabac babcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabaca bcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacab cbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabac babcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacba bcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbab cabacabcacbabcabacabcbabcabacbabcacbacabacbabcab
203.8		TURTLE::GILBERT		`Mon Sep 16 1985 18:13`	48
	There can be an infinitely long string of As, Bs, and Cs, containing no duplicate consecutive substrings. Proof (by construction): Take any string of As, Bs, and Cs, with no duplicate consecutive substrings, and make the following simultaneous substitutions (the spaces are added for readability, and also split the strings into uniquely identifying substrings): A -> ABACABCACBABCB ACABACBABC B -> ABACABCACBABCABACBAB CACBC C -> ABACABCACBABCABACBCA BCBABC This produces a longer string containing no duplicate consecutive substrings, by the following theorems. Theorem 1: The above construction (which we'll rewrite as: A -> A1 A2, B -> B1 B2, C -> C1 C2), when applied to any string (x) of As, Bs, and Cs containing no duplicate consecutive substrings will yield a longer string (y) containing no duplicate consecutive substrings of length greater than 26. Proof (by contradiction): Given a counter-example (y), we can use the technique in 203.5 to find a counter-example where the repeated substring is a concatenation of the three strings: A1 A2, B1 B2, C1 C2. Because of the uniquely identifying substrings, the string (x) must have contained a repeated consecutive substring (consider the reverse transformation). Theorem 2: The above construction, when applied to any string (x) of As, Bs, and Cs containing no duplicate consecutive substrings will yield a longer string (y) containing no duplicate consecutive substrings of length less than or equal to 26. Proof (by exhaustion): Checking that the construction process, when applied to any input string of 5 or fewer As, Bs, and Cs containing no repeated consecutive substrings (note that 5 occurences are used to assure that the length of the constructed string is at least 25+26+26+25). - Gilbert
203.9		SPRITE::OSMAN		`Wed Sep 18 1985 16:02`	14
	sigh. Often, I really WANT to understand some gritty proof, but then some simple thing near the beginning throws me, and I'm lost to all the rest, and I get frustrated. In .5, the following confuses me: >following sense. A substring t of s uniquely identifies the s , if, in any > i i >concatenation of the s, (s \|s \|...\|s ), containing exactly one occurrence > 1 2 n >of s , there is exactly one occurence of t. What is that thing with the "" after it ??? /Eric
203.10		BEING::POSTPISCHIL		`Wed Sep 18 1985 16:52`	35
	Re .9: If r and s are sets containing strings, rs is the set containing every concatenation of a string in r with a string in s. r\|s or r+s is the union of r and s. r* (usually written as a superscript and called the Kleene star [I am uncertain of the spelling]) is the set containing the empty set and all the strings of r, rr, rrr, and so on. There is often a mixing of notation, so that a string x represents the singleton x (the set containing only the string x). Examples: Let S = { "abc" }. Informally, S = abc. a+bc = { "a", "bc" }. (a+b)c = { "ac", "bc" }. * (a+b) c = { "c", "ac", "bc", "aac", "abc", "bac", "bbc", "aaac", . . . }. 2 S = { "aa", "ab", "ac", "ba", "bb", "bc", "ca", "cb", "cc" }. -- edp
203.11		TURTLE::GILBERT		`Wed Sep 18 1985 20:00`	29
	re .9: Sorry about the confusion. Basically, if there's a t(i) in the constructed string, you know there's a s(i) there. I had trouble finding some way to make this precise, while handling all the 'silly' special cases -- even so, I think that part lacks something. re .6: On what device is that? re .0, .2: Thanks for an interesting problem! The lexically first string is fairly interesting... I know the readers of this file aren't interested in algorithm design, but is there a fast way of generating the first n characters of this string (like order(n log n), for large n)? If you take the first 22526 characters (in 203.7), and split them into pieces just before an occurence of "abacabcacb", you'll get 498 pieces, and except for one that's 100 characters long, they range from 13 to 79 characters. Of these 498 pieces, there are ONLY 21 different kinds! Consider this: If you split the string into regular 4-character chunks (i.e., small), you'd find more than just 21 kinds of pieces! Might the entire lexically first string simply be a concatenation of these 21 different pieces? It's hard to believe this possible. The patterns (easier to recognize when broken into chunks) are very regular, yet not too regular, else there'd be a repeated substring. Might there be some way to turn this into a nice display?
203.12		BEING::POSTPISCHIL		`Thu Sep 19 1985 08:11`	8
	Re .11: I don't know how I forgot the device, especially since I remember checking to make sure I had the right one. The full specification is BEING""::USER1:[POSTPISCHIL.PUB]NR.EXE or .C. -- edp
203.13		SPRITE::OSMAN		`Thu Sep 19 1985 18:25`	207
	Well, I'm still working on understanding that proof. In the mean time, consider a totally different algorithm of generating strings without duplicate repeating substrings. Namely, start with an initial "good" string; the null string is as good as any. Then, attempt to append an "A". If this is invalid, append a "B" instead, unless that's invalid too, in which case append "C". If "C" is invalid, shrink the string by one character, and assume the last letter of the shrunken string is invalid and increment it etc. Hence we proceed like this: A AA no AB ABA ABAA no ABAB no ABAC ABACA ABACAA no ABACAB ABACABA ABACABAA no ABACABAB no ABACABAC no Here's the first place we back up ABACABB no ABACABC etc. Now, consider a program thinking all of the above to itself. The program desires to print a continuous stream of letters such that treated as one string, it contains no duplicate repeating substrings. The first thought might be to perform the above algorithm, and every time the total string is longer than what's been printed already, print some more letters. The problem, though, is that the program would end up needing to back up and change letters that have already been printed ! So, to avoid this problem, introduce a SAFETY_MARGIN, which is the number of letters beyond what's been printed that the string must grow to, before the program dares "commit" to printing more letters. An interesting question is, what SAFETY_MARGIN is sufficient ? My experimental evidence is that once "things get going", 6 seems sufficient. Can someone prove this ? Another way of putting it is that experimentally, we never need to back up more than 5 positions. I've included a C program in this note that implements the algorithm. The program has a safety_margin of 20 built in to it. However, once things "get going", you can decrease the margin to 6 like this: ^Y $ DEBUG DBG> DEPOSIT safety_margin=6 DBG> GO What the program does is print a continuously growing "valid" string. However, the program interrupts the printout by announcing new "close calls" which are measured by a new minimum number of characters away from failure due to needing to modify an already-printed character. Every time the program prints more characters, it resets its tally of close calls. Hence you see more interruptions than you might want. Of course, you can comment out the close call announcement if you'd like a real continuous string printed out. Another interesting question is, what would be a less expensive algorithm for printing a continuous valid string ? Gilbert's algorithm doesn't allow printing a continuous string at all, since we must keep going back and replace already-printed letters by other letters. My algorithm is very expensive, since more and more STRNCMP's (string comparison's) are necessary as the string gets longer. Is there an algorithm that perhaps doesn't require string compare's at all ? Perhaps something like "xor all the letters so far and . . .". As a subquestion for my algorithm, must ALL substrings actually be checked ? For instance, once we "get going", would adding a new letter or incrementing the last ever produce an ALMOST valid string, only invalidated by being of the form YY, where Y is half the entire string ? (If not, then we need never compare Y with Y and we can ask the next question about Y-1 Y-1 etc.) Here's the program: /* Define size of largest string we can handle. / #define max_i 50000 / String to contain the characters. / char the_string[max_i]; / Define margin. We print letters to within this close to the end of our /* string. We fear not get closer, lest we end up needing to change a letter /* we've already printed (a danger we detect during the program). / safety_margin = 20; / Define close_call, which indicates how close we've ever gotten to having /* to give up due to having already printed a letter we need to change. / close_call = 100; / Define suspence value, which is number of letters beyond what we've /* already printed, that we compute before even considering printing more. / suspence = 0; / Define which index we've printed up to. / largest_i_printed = -1; main () { int i; / Initialize string to all one less than first letter. This allows main loop /* to always think it is "modifying" a letter. / for (i = 0; i <= max_i; (the_string[i] = 'A' - 1, i++)); / Start loop which establishes i, which indicates which letter is the next /* one to modify. 0 means the first letter. / for (i = 0; i <= max_i \|\| i < 0; i++) { int j; if (the_string[i] == 'C') { / We've already tried A, B, and C in current position, so we back up such /* that a new letter is tried in previous position. /* We make sure we're not backing up into a letter that's already been printed. /* If so, we quit. / the_string[i] = 'A' - 1; i = i - 2; if (i - 1 <= largest_i_printed) { printf ( "\nWhoops. In order to proceed, I'd need to change an already-printed letter!" ); return; } / Announce if we're closer than we've ever been to having to give up. / if (i - 1 - largest_i_printed < close_call) { close_call = i - 1 - largest_i_printed; printf ("\n New closest call = %d\n", close_call); } continue; } / end of check for all letters tried in this position / / Whatever letter we modify, we make it one larger than it was (which sets /* it to 'A' if it was "never" modified) / the_string[i]++; / Start loop to make sure there are no duplicate repeating substrings. / for (j = 1; j <= (i+1)/2; j++) { if (strncmp (&the_string[i-2j+1], &the_string[i-j+1], j) == 0) { /* There is a duplicate repeating substring, so tell main loop to increment /* current character instead of moving on to next position. / i--; break; }; } / We'll now print some more letters if we've computed sufficiently ahead /* so as to believe we won't have to back up into stuff already printed, /* AND if we've computed sufficiently more than has already been printed. / if (i - safety_margin > largest_i_printed && i - suspence > largest_i_printed) { / Mark where we're starting to print from. / int start_here = largest_i_printed + 1; / Compute what will be the new largest index printed. / largest_i_printed = i - safety_margin; { / Print the chunk. We do this by planting a null character and using /* the function that prints through next null. Then we restore the character /* that we temporarily nulled. (Too bad a "precision" value in PRINTF can't /* be a variable, or at least a symbolic constant) / int save_beyond = the_string[largest_i_printed + 1]; the_string[largest_i_printed + 1] = 0; printf ("%s", &the_string[start_here]); close_call = safety_margin; the_string[largest_i_printed + 1] = save_beyond; } / end of block for variable save_beyond / } / end of print block / } / end of main loop / / We get out of main loop for one of TWO reasons. Either we backed all /* the way back to the beginning, which means we've proven that there is /* a largest string not containing any duplicated substring, OR we filled /* our entire array. / if (i <= 0) printf ( "\nThe longest string containing no duplicate substrings was found !"); else printf ( "\nNo conclusion reached. You may want to increase 'max_i' and try again."); } / end of procedure main */
203.14		ALIEN::POSTPISCHIL		`Fri Sep 20 1985 08:24`	8
	Re .8: I have a nit: There are no "infinitely long strings" satisfying the stated requirements, as .8 claims. However, as the proof shows, there length of such strings is unlimited, although each string is finite. -- edp
203.15		TURTLE::GILBERT		`Fri Sep 20 1985 13:14`	49
	re 14: Apply the infinity lemma (I think it applies, and Knuth has a proof). re 13: Try applying the substitution, and supposing that a repeated substring is generated. This should illustrate the gist of the proof. Or, if you're around ZK2 sometime, stop by, and I'll try to go explain it. re 13: You're generating the 'lexically first' string (good). Up through a length of 11,000, there's only one place where you need to back up 5 positions (near length 81). There are only 23 places where you need to back up 3 or more places (one 5, 13 fours, and 9 threes). The string is very regular (pretty much). Another question is how far back you look to discover a duplicated substring. Usually, not much, but when you do.... I displayed a line whenever the algorithm needed to search back more than 63 places to discover a duplicated substring, and there were only 54 of these (again, in the first 11,000 characters). The amount it needed to search back (when more than 63) ranged from 92 (at 2174) to 2532 (at 5928). Many of these 'back-search' amounts occurred frequently; 92 - 6 times 100 - 1 time (at 437) 110 - 7 times 124 - 19 times (wow, is this the same substring each time?) 261 - 7 times 274 - 9 times 1225 - 3 times (at 3351, 5883, and 9183) 2532 - 1 time (at 5928) If we could prove that these are the only 'search back' amounts (this is doubtful), or even restrict 'large' amounts to a fairly small set, then the algorithm for producing the 'lexically first' string could be enormously improved, probably even to linear time! Might it may happen that later we need to back up thousands of characters!? I don't think so; consider the string "CBCACBACABC", which is the start of the 'lexically last' string, gotten by taking the start of the lexically first string, and changing A <-> C. It's unlikely to occur in the 'lexically first' string (and, in fact, it doesn't occur in the first 22,000 characters). Also, it doesn't occur in the string generated by 203.8. So we can take what's been generated, append "CBCACBACABC" and the infinite string generated by 203.8 to produce an infinitely long string. Certainly, the true lexically first string must be less than this, and must be greater than what we've generated so far. Now if we could only guarantee that "CBCACBACABC" never occurs in the lexically first string....
203.16		BEING::POSTPISCHIL		`Fri Sep 20 1985 14:33`	6
	Re .15: Can you supply a reference for Knuth's proof of the infinity lemma? -- edp
203.17		METOO::YARBROUGH		`Fri Sep 20 1985 14:48`	3
	The Art of Computer Programming, Vol. 1, p. 381 et seq. Lynn Yarbrough
203.18		LATOUR::AMARTIN		`Sat Sep 21 1985 11:24`	50
	Re .10: r* (usually written as a superscript and called the Kleene star [I am uncertain of the spelling]) is the set containing the empty set and all the strings of r, rr, rrr, and so on. Uh, actually, it is: r* (usually written as a superscript and called the Kleene star [I am uncertain of the spelling]) is the set containing the empty string and all the strings of r, rr, rrr, and so on. ^^^^^^ Re .13: I think you'll find that the algorithmic paradigm you have used is called backtracking, just in case you want to look it up, or you want to know when a paper that uses the term might be worth reading. Someone (Wirth?) wrote an excellent walk-through of the design of a structured, efficient Pascal program to solve the 8-queens problem that uses backtracking. I'll try and find the reference and post it here. Surprise! Your wishes about printf have been answered. You may want to change this code: { /* Print the chunk. We do this by planting a null character and using /* the function that prints through next null. Then we restore the character /* that we temporarily nulled. (Too bad a "precision" value in PRINTF can't /* be a variable, or at least a symbolic constant) / int save_beyond = the_string[largest_i_printed + 1]; the_string[largest_i_printed + 1] = 0; printf ("%s", &the_string[start_here]); close_call = safety_margin; the_string[largest_i_printed + 1] = save_beyond; } / end of block for variable save_beyond / to this: { / Print the chunk. (Luckily a "precision" value in PRINTF can /* be a variable, or any runtime expression) / printf ("%.s", largest_i_printed + 1, &the_string[start_here]); close_call = safety_margin; } /* end of block that prints the chunk */ I just added one of these in my project's code. /AHM
203.19		BEING::POSTPISCHIL		`Sat Sep 21 1985 17:53`	8
	Re .18: Oops, you're right, that should have been the empty set. Is the ".*" standard in any way? -- edp
203.20		LATOUR::AMARTIN		`Sun Sep 22 1985 22:59`	6
	I don't have K&R around, but I seem to remember that it is mentioned in Harbison and Steele. Note that you can't use * for variable formats in scanf because it means "don't assign the value to a variable". Sigh. /AHM
203.21		SPRITE::OSMAN		`Mon Sep 23 1985 11:27`	21
	Although the stream algorithm required checking a substring of length 2532 in order to invalidate a candidate at L=5928, this still isn't quite the entire string. It might be interesting to modify the program to announce what larger percents were required, rather than raw lengths. Then the question becomes what percentage of the string must we check to guarantee validity. Does anyone have a proof that backing up more than 5 positions is never required ? It must have something to do with the fact that 5 is one less than twice 3, and 3 is the number of symbols we are dealing with. Mr. Gilbert, you mentioned that the stream is very "regular". Can you be more specific ? What patterns do you notice ? Perhaps if we can identify a pattern, we can ask (and answer) the question "prove that the infinite string generated by the following pattern is a valid string" /Eric
203.22		R2ME2::GILBERT		`Mon Sep 23 1985 14:26`	9
	re 203.13: "Must all substrings actually be checked?" No. In generating the lexically first substring (which you'll recall starts "abacabcacbabcabacabcacbabacbabc"), we need never check for a string of the form YY (once we've gotten more than 25 characters). To see this, note that the beginning of the string can never reappear -- certainly, the immediately preceeding character can't be an "a", nor can it follow a "b" (since then we'd have "baba"). And it can't follow a "c", since that would cause the string "cabacabcacbab" to be duplicated.
203.23		SPRITE::OSMAN		`Mon Sep 23 1985 17:32`	24
	In the published first 22528 characters of the lexically first string, the numbers of a's, b's, and c's are: a - 8373 b - 7529 c - 6626 The fact that there are more a's is what I'd expect. This is because the algorithm always tries "a" first, and only replaces the position with "b" if it gets into trouble. Hence anywhere that "a" OR "b" would work, the "a" is published. Limit question: As we produce more and more charactersof this lexically first string, is there a convergent limit as to the ratio of numbers of a's to b's to c's ? What is that limit ? Counting question: How many valid strings are there of length n ? /Eric
203.24		ALIEN::POSTPISCHIL		`Mon Sep 23 1985 18:21`	36
	Re .23: Here are the numbers of strings by length: Length Number 0 1 1 3 2 6 3 12 4 18 5 30 6 42 7 60 8 78 9 108 10 144 11 204 12 264 13 342 14 456 15 618 16 798 17 1044 18 1392 19 1830 20 2388 21 3180 22 4146 23 5418 24 7032 25 9198 26 11892 27 15486 -- edp
203.25		ALIEN::POSTPISCHIL		`Tue Sep 24 1985 08:40`	8
	.22 is incomplete because we have not yet proven it will not be necessary to back up as far as the twenty-fifth character, although it seems unlikely. For that matter, we cannot yet be certain that Gilbert's string actually is the lexically-first string. -- edp
203.26		ALIEN::POSTPISCHIL		`Tue Sep 24 1985 08:43`	6
	Re .19: I can't even get my corrections right; that should have been "empty string". -- edp
203.27		SPRITE::OSMAN		`Tue Sep 24 1985 09:05`	12
	Regarding the comment about "We can't be sure Gilbert's is really the lexically first string", if it's the same as generated by my program, then it is the first string, by definition, since the program tries to lay down "AAAA..." and when it can't, it lays down the closest thing to it. Regarding counting strings for each length n, we might as well look at f(n)/6, since f(3...) are necessarily all divisible by 6. This is because every string is related to five others by changing A,B,C's to A,C,B or B,A,C or C,B,A or B,C,A. There will be duplicates, though. /Eric
203.28		BEING::POSTPISCHIL		`Tue Sep 24 1985 09:16`	9
	Re .27: The reason Gilbert's string might not be the first string is that we might need to back up over part of what he has already said is part of the string. In fact, if there is no infinitely-long string, there will be no lexically- first string. -- edp
203.29		R2ME2::GILBERT		`Tue Sep 24 1985 13:22`	22
	To recap the important notes in this discussion.... 203.0 Poses the problem; Is there an infinite string? .1-.4 A rough proof that is flawed. Noticing the flaw. .5 Outline of a proof. .6,.12 EDP's program to find strings. .7 First 22528 characters of lexically first string. .8 Proof that arbitrarily long strings exist. .9-.10 Kleene notation. .11 Comments on lexically first string. .13 Osman's program for lexically first string; how far might it back up? .14 Arbitrarily long is not infinite. .15 Comments on backing up. .16-.17 Reference for the infinity lemma. .18-.20 Notation, terminology, and printf. .21 How far might we back up? How far back to search? .22 Needn't search back all the way (also see last paragraph in .15). .23-.24 Character distribution, and counting question. .25-... Here I get confused. There are several interesting, open questions. Would someone care to gather these in another note?
203.30		R2ME2::GILBERT		`Tue Sep 24 1985 14:37`	54
	re .14, .28 (and .16-.17): Does the infinity lemmma apply? I'll continue to assume it does, or can be made to apply, or that an infinitely long string exists for other reasons. Consider this. The substitution algorithm can be applied again and again to yield a countably infinite number of strings (that is, in a one-to-one correspondence with the counting numbers). Suppose an infinitely long string (containing no consecutive duplicated substrings) doesn't exist; then some such string must be a longest -- let the length of a longest such string be M. This implies that there are fewer than 3M different such strings (actually, a bound of 2M could be used). This is a contradiction, since we know there's an infinite number of such strings. re .25, .22, .15: The last paragraph in .15 can be used to show that we need never back up as far as the 25th character, since it shows a construction that gives an infinitely long string that starts off the same as the lexically first string (for 22,000 characters). -> Note that the algorithm in .13 approaches the lexically first string from below. Any particular infinitely long string places an upper bound on the lexically first string. If the lower bound and the upper bound start off the same way, then so must the true lexically first string. Note .22 has a typo -- the lexically first string starts "abacabcacbabcabacabcacbacabacba". To correct the argument on why this string cannot reappear later in the string.... If it were to follow "ac", then "acab" would be repeated. Were it to follow "bc", then "bcabacabcacba" would be repeated. Since the first 24 characters of the lexically first string are never repeated, after we've gone far enough (past 50 characters), we need never worry about searching all the way to the beginning to discover a string of the form YY. Of course, this knowledge offers practically nothing for improving the performance of an algorithm to generate the lexically first string. re .11: The '21 different pieces' mentioned there. I've changed my mind -- it looks like it should be possible to prove that the lexically first string is simply a concatenation of these 21 pieces! Stay tuned. re "Gilbert's string": I'm unsure what this means. The characters in .7 were generated by an algorithm like that in .13. A string generated by the substitutions in .8 does not produce the lexically first string. Notes .7 and .8 are largely unrelated.
203.31		R2ME2::GILBERT		`Sat Sep 28 1985 10:04`	87
	Do you want to see the first 84992 characters of the first string? Let: d = abacabcacbabcabacabcacbacabacbabc e = abacabcacbabcabacabcacbcabacabcbabc f = abacabcacbabcabacabcbabcabacbabcacbacabacbabc g = abacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbc h = abacabcacbabcabacbabcacbacabacbabc i = abacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabc j = abacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacbabcacbacabacbabc k = abacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcacbabc l = abacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcacbc m = abacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbc n = abacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabc o = abacabcacbabcabacbabcacbacabacbabcabacbcabcbabc p = abacabcacbabcabacbabcacbacabcacbabc q = abacabcacbabcabacbabcacbacabcbabc r = abacabcacbabcabacbabcacbc s = abacabcacbabcabacbabcbacabacbabc t = abacabcacbabcabacbcabcbabc u = abacabcacbabcbacabacbabc v = abacabcacbabcbacabacbabcabacabcacbacabacbabcabacabcbabcabacbabcacbacabacbabc w = abacabcacbabcabacabcacbcabcbabc Then we can write the string as follows (numbers represent character offsets in the string): 84992: 0 dhvifjlefiflefiuhstfifmefiflefifmefifnfofstfifmefiflefifmefifnfohpuhstfi 3642 fmefiflefifmefifnfofstfifmefiflefifmefifnfohqfiflefiflfifkhrefiflefifmef 7366 ifnfofstfifmefiflefifmefifnfofstfifogefiflefifmefifnfofstfifmefiflefifme 11054 fifnfofstfigefiflefifmefifnfofstfihpuhstfifmefiflefifmefifnfofstfifmefif 14638 lefifmefifnfohpuhstfifmefiflefifmefifnfofstfifmefiflefifmefifngefiflefif 18316 mefifnhpuhstfifmefiflefifmefifnfofstfifmefiflefifmefifnfohpuhstfifmefifl 21859 efifmefifnfofstfifmefiflefifmefifnfstfifmefiflefifmefifogefiflefifmefifn 25601 fofstfifmefiflefifmefifnfofstfigefiflefifmefifnfofstfihpuhstfifmefiflefi 29180 fmefifnfofstfifmefiflefifmefifnfohpuhstfifmefiflefifmefifnfofstfifmefifl 32778 efifmefifngefiflefifmefifnhpuhstfifmefiflefifmefifnfofstfifmefiflefifmef 36431 ifnfohpuhstfifmefiflefifmefifnfofstfifmefiflefifmefifnfstfifmefiflefifm 40017 wfiflefiflwfi 40715 flfifkhrefiflefifmefifnfofstfifmefiflefifmefifnfofstfifogefiflefifmefifn 44448 fofstfifmefiflefifmefifnfofstfigefiflefifmefifnfofstfihpuhstfifmefiflefi 48027 fmefifnfofstfifmefiflefifmefifnfohpuhstfifmefiflefifmefifnfofstfifmefifl 51625 efifmefifngefiflefifmefifnhpuhstfifmefiflefifmefifnfofstfifmefiflefifmef 55278 ifnfohpuhstfifmefiflefifmefifnfofstfifmefiflefifmefifnfstfifmefiflefifme 58898 fifogefiflefifmefifnfofstfifmefiflefifmefifnfofstfigefiflefifmefifnfofst 62597 fihpuhstfifmefiflefifmefifnfofstfifmefiflefifmefifnfohpuhstfifmefiflefif 66142 mefifnfofstfifmefiflefifmefifngefiflefifmefifnhpuhstfifmefiflefifmefifnf 69825 ofstfifmefiflefifmefifnfohpuhstfifmefiflefifmefifnfofstfifmefiflefifmefi 73427 fnfstfifmefiflefifmwfiflefiflfifkhrefiflef 75605 ifmefifnfofstfifmefiflefifmefifnfofstfifogefiflefifmefifnfofstfifmefifle 79293 fifmefifnfofstfigefiflefifmefifnfofstfihpuhstfifmefiflefifmefifnfofstfif 82877 mefiflefifmefifnfohpuhstfifmefiflefifmefifabacabcacbab But if we also use the following substitutions (here, I've used upper case): A = dhvifjle B = fiflefiflfifkhre C = fiflefiflwfiflfifkhre D = fiflefifmefifabacabcacbab E = fiflefifmefifnfofstfifme F = fiflefifmefifnfofstfifoge G = fiflefifmefifnfofstfige H = fiflefifmefifnfofstfihpuhstfifme I = fiflefifmefifnfohpuhstfifme J = fiflefifmefifnfohq K = fiflefifmefifnfstfifme L = fiflefifmefifnge M = fiflefifmefifnhpuhstfifme N = fiflefifmefifoge O = fiflefifmw P = fiflefiuhstfifme Then we can write the first 84992 characters as: APEIEJBEFEGHEIELMEIEKNEGHEIELMEIEKOCEFEGHEIELMEIEKNEGHEIELMEIEKOBEFEGHEID During the course of generating this, several very long 'search back amounts' were found, at the following character positions: search: 15538 40042 search: 15538 74427 search: 34385 74967 No 'backup amounts' greater than 4 were found, besides the backup of 5 at 81. - Gilbert
203.32		LATOUR::AMARTIN		`Sun Sep 29 1985 16:14`	7
	Re .19: While "%.*s" is not mentioned in K&R, it is mentioned, with no portibility qualification, in Harbison & Steele, and is supported by both Vax-11 C and Ultrix. It is also mentioned in Bourne's "The UNIX System" and the 30-Apr-85 draft ANSI C standard. /AHM
203.33		SPRITE::OSMAN		`Mon Sep 30 1985 16:53`	29
	I'm still working on trying to prove that the straight-forward infinite string generation program will never have to back up more than 5 places, or that indeed it eventually will have to. I don't have a proof yet. However, I'm trying a contradiction approach. I've been thinking, suppose our string so far ends with: ...CBCACBC This has the property that if we need to back up, we'll have to back up at least seven places (because the last "C" means we've already tried "A" and "B", the last "B" can't be changed to a "C", the second to last "C" means "A" and "B" tried already, the last "A" can't be changed to "B" or "C" etc.). So, if we can show that we'll encounter ...CBCACBC at some point AND have to back up, then we've shown that yes, sometimes you have to back up more than 5 places, so we're done with the proof. On the other hand, if we can show that we'll never encounter ...CBCACBC at all, or if we can show that if we do encounter ...CBCACBC, we'll never have need to back over it, then we might be closer to proving that you never need to back up seven places. For instance, if we can show this, then if we iterate on all possible seven-letter endings, we'll have a proof by exhaustion. /Eric
203.34		R2ME2::GILBERT		`Mon Sep 30 1985 19:03`	12
	Well, "...CBCACBC" cannot occur, since the 'lead-in' must be either an "A" or a "B", and either of these would prevent the algorithm from looking that far (consider "..ACBCACBC" or "..BCBCACBC"). However, this is a nit, depending on where you start 'backing up'. Note that the substring "ACBCACB" does occur in the string. If there were something that prevented an "A" from following this, we'd backup 7 or more. I'd expected that the substring "CBCACB" wouldn't occur (it's so 'large') -- but it does. Now I wouldn't be surprised if "CBCACBAC" occurred somewhere (BTW -- it doesn't occur in the first 131000 characters). Does anyone have a faster way of generating the first string?
203.35		SPRITE::OSMAN		`Mon Sep 30 1985 19:50`	14
	Formally, Peter's printout of the first 84992 characters might not be correct. Until we prove that we'll never need to back up more than 5, or some other small number, there's still the slight fear that some future backup will be required that crosses back into what Peter's already printed. The conjecture that we'll never have to back up more than 4 is probably harder to prove than the one stating we'll never have to back up more than 5. That's because the one about 4 will have to include the singularity at index=81, where we needed to back up 5. Fascinating ! /Eric
203.36		GOLLY::GILBERT		`Tue Oct 01 1985 14:27`	2
	Ah, but it's fairly simple to show that the printout of the first 84992 characters is valid. The last paragraph of 203.15 shows how to do this.
203.37		SPRITE::OSMAN		`Wed Oct 02 1985 15:32`	10
	That paragraph referred to ends with: > than what we've generated so far. Now if we could only guarantee > that "CBCACBACABC" never occurs in the lexically first string.... Since we can't yet guarantee blah blah, I still claim we might have to back up into what you've already printed. (although I'm still working on proving we'll never have to back up more than 5 characters) /Eric
203.38		SPRITE::OSMAN		`Thu Oct 03 1985 15:53`	20
	Define a "valid substring" as one that is not an "invalid substring". "invalid substrings" are ones that either contain a duplicate repeating substring OR ones that can't possibly occur as part of a longer string due to not being able to be preceded or followed by anything. (i.e. "abacaba" is invalid because neither a,b, or c may follow it). So, listing all valid substrings by order of size, we have a, b, c, ab, ac, ba, bc, ca, cb, aba, abc, aca, acb, bab, bac, bca, bcb, cab, cac, cba, cbc, abac, abca, abcb, acab, acab, acba, acbc, babc, . . . Considering valid substrings in this alphabetical ordering, what's the first one that doesn't appear in the lexically-first string ? What's the first one that doesn't appear in Gilbert's printout of 22000-or-so characters, for that matter ? /Eric
203.39		GOLLY::GILBERT		`Fri Oct 04 1985 14:05`	16
	re 203.38: A fairly easy way to check for the occurence of these strings is to consult the 'macroes' in 203.81. new question: Can somebody prove that the string "abaca" occurs an infinite number of times in any infinitely long string in NR? Put another way, can you show that there's some maximum distance that can separate two occurrences of "abaca"? If that's too hard, how about "ab"? Or how about finding some string whose occurences can be arbitrarily far apart? news: Another large backup (of 5) was found near offset 190286. Also, I'd like an independent verification of the lexically first string -- have we any mathematicians/programmers up for this difficult task?
203.40		R2ME2::GILBERT		`Mon Oct 07 1985 12:54`	42
	Noticing that large strings tend to repeat in the lexically first string, perhaps there's some way to extend the string by large amounts, instead of effectively duplicating previous work. Consider the string that's been generated so far (less 10 characters, in case we back up that much -- this we'll assume, pending Eric's proof). Now find the shortest suffix string that doesn't occur elsewhere in the string. Let this smallest suffix string be tUV, where t is either a, b, or c, and U and V are strings, and the string tU is the shortest prefix of tUV that doesn't occur previously in the string. Thus, the string so far must look like: "...sUV...tUV", where s and t differ. note that the string UV must occur previously, else tUV would would not be the shortest unique suffix string. Now, the next 'several' characters should depend only on the string UV, and because the string UV has occurred before, we know what these characters should be. How much is 'several'? When searching back for a repeated string, we need look at t or earlier only if we are about to produce a duplicated string of one of the forms: V ...[...tUV...][...tUV...] ...[�V...t�][�V...t�] ...[...tUV...][...tUV...] ^ The arrows indicate where we 'are', and U has been broken into two parts: ��. Note that the first cannot occur, since we're assured there are no occurences of tUV to the left of where we are. Thus, we need only worry about the last two forms. We can get an idea of how far we may proceed by searching back for a previous occurence of V (the second form), or checking when we produce another V (the third form). The same considerations applied to the first occurence of the string sUV, which must also have been a unique suffix string. We can find how far sUV proceeded before it almost caused a duplication (again, look for a V, less 10 characters), and proceed the same way after tUV (after checking that this would cause neither the second or third forms above). Thus, we should be able to duplicate a large chunk of previous effort, and a few more characters (generated the simple way) should yield a shorter unique suffix string, whence we can repeat this extension technique. - Gilbert
203.41		SPRITE::OSMAN		`Mon Oct 07 1985 15:56`	34
	I don't really think the macros in 203.31 help that much in deciding if a given valid substring appears in s(84992) or not. Sure, you can see if the substring is completely contained in a macro, but checking for all possible double-macro overlaps is tedious. I have started a C program that has s(22800) compiled into it as a string. The program will use a procedure called BUMP, which when called with a valid substring, will produce the next possibly-valid substring. (I say "possibly-valid" because BUMP doesn't really check if A,B or C can precede or follow the string) The program will then make sure that substring is contained in s(22800), and if not, the program will print it out. I'll go up through substrings of eight characters, I suppose. I'm having a bit of trouble coding BUMP. Perhaps some hotshot C programmer can do it ? To review, here are some sample inputs and what BUMP should produce as output: input BUMP should output ----- ------------------ a b b c c ab cbc abac abac abca Can anyone write BUMP ? /Eric
203.42		R2ME2::GILBERT		`Mon Oct 07 1985 19:32`	143
	Actually, I think the macroes in 203.31 are pretty useful for answering such questions. The macroes 'start off' with the same 10 characters, so to determine whether "cbcabacbcab" appears, lop off the "ab" (which might begin another macro), and search for "cbcabacbc" in the macroes. The reason this is fairly useful is that it sure beats search though a 85K string. Of course there are even better ways. Oh well. The following 133 16-character strings are effectively what you've requested in 203.41, except for the six simple substitutions of the characters. abacabcacbabcaba abacabcacbabcbac abacabcacbacabac abacabcacbacabca abacabcacbacabcb abacabcacbcabaca abacabcacbcabacb abacabcacbcabcba abacabcbabcabaca abacabcbabcabacb abacabcbabcacbab abacabcbabcacbac abacabcbabcacbca abacabcbacabacba abacabcbacabacbc abacabcbacbcabac abacabcbacbcabcb abacabcbacbcacba abacbabcabacabca abacbabcabacabcb abacbabcabacbcab abacbabcabacbcac abacbabcacbacaba abacbabcacbacabc abacbabcacbcabac abacbabcacbcabcb abacbabcbacabacb abacbabcbacabcac abacbabcbacabcba abacbabcbacbcaba abacbabcbacbcabc abacbabcbacbcacb abacbcabacabcacb abacbcabacabcbab abacbcabacabcbac abacbcabacbabcab abacbcabacbabcac abacbcabacbabcba abacbcabcbabcaba abacbcabcbabcacb abacbcabcbacabac abacbcabcbacabca abacbcabcbacabcb abacbcabcbacbcab abacbcabcbacbcac abacbcacbabcabac abacbcacbabcacbc abacbcacbabcbaca abacbcacbabcbacb abacbcacbacabacb abacbcacbacabcac abacbcacbacabcba abcabacabcacbabc abcabacabcacbaca abcabacabcacbcab abcabacabcbabcab abcabacabcbabcac abcabacabcbacaba abcabacabcbacbca abcabacbabcabaca abcabacbabcacbab abcabacbabcacbac abcabacbabcacbca abcabacbabcbacab abcabacbabcbacbc abcabacbcabcbabc abcabacbcabcbaca abcabacbcabcbacb abcabacbcacbabca abcabacbcacbabcb abcabacbcacbacab abcacbabcabacabc abcacbabcabacbab abcacbabcabacbca abcacbabcbacabac abcacbabcbacabca abcacbabcbacabcb abcacbabcbacbcab abcacbabcbacbcac abcacbacabacbabc abcacbacabacbcab abcacbacabacbcac abcacbacabcacbab abcacbacabcacbca abcacbacabcbabca abcacbacabcbacbc abcacbcabacabcac abcacbcabacabcba abcacbcabacbabca abcacbcabacbabcb abcacbcabacbcacb abcacbcabcbabcab abcacbcabcbabcac abcacbcabcbacaba abcacbcabcbacabc abcacbcabcbacbca abcbabcabacabcac abcbabcabacabcba abcbabcabacbabca abcbabcabacbabcb abcbabcabacbcaba abcbabcabacbcabc abcbabcabacbcacb abcbabcacbabcbac abcbabcacbacabac abcbabcacbacabca abcbabcacbacabcb abcbabcacbcabaca abcbabcacbcabacb abcbabcacbcabcba abcbacabacbabcab abcbacabacbabcac abcbacabacbabcba abcbacabacbcabac abcbacabacbcabcb abcbacabacbcacba abcbacabcacbabca abcbacabcacbabcb abcbacabcacbacab abcbacabcacbcaba abcbacabcacbcabc abcbacabcbabcaba abcbacabcbabcacb abcbacbcabacabca abcbacbcabacabcb abcbacbcabacbabc abcbacbcabcbabca abcbacbcabcbacab abcbacbcacbabcab abcbacbcacbabcac abcbacbcacbabcba abcbacbcacbacaba abcbacbcacbacabc
203.43		BEING::POSTPISCHIL		`Tue Oct 08 1985 08:45`	13
	I've been trying to figure out how to generate the lexically first string (or any string for that matter), and I have a question about how Gilbert is doing it. I have thought of a few schemes to permit adding on large chunks, but I can't seem to make anything workable. The best I have thought of so far is to select a character and test to see if it causes a repeated substring. If the string has a length l, this test requires something less than l/2 comparisons. Gilbert, is character-by-character essentially how you have generated strings? How many comparisons do you make for each character? -- edp
203.44		LATOUR::AMARTIN		`Tue Oct 08 1985 09:41`	11
	I wonder what kind of loop-the-loop patterns develop as you build an Aho-Corasick pattern matcher to recognize longer and longer prefixes of the lexically first `string'? How about building a DFA to recognize all members of NR of length less than k? (Or an NFA for that matter). How does the number of states (for any of the above problems) grow as a function of the length recognized? What about context-free recognizers for that grammar that describes such large leaps of the lexically first string. /AHM
203.45		BEING::POSTPISCHIL		`Tue Oct 08 1985 11:40`	13
	Re .44: You're not using "DFA" for "deterministic finite-state automaton", are you? If so, there exists a single automaton capable of recognizing all finite members of NR, although it requires some sort of infinite memory (such as the tape for a Turing Machine). On the other hand, things such as finite-state parsers will get larger as the members of NR to be recognized increase. The increase will be at least exponential: At a state in the middle of parsing a string, the parser state will need to "encode" at least the last l/2 characters (where the length of the string being parsed is l), requiring at least 3**(l/2) states. -- edp
203.46		R2ME2::GILBERT		`Tue Oct 08 1985 13:08`	18
	If we find a unique suffix string (i.e., the string so far is: "...tUV", where tU doesn't occur previously in the string), then, when searching back looking for a repeated string, we need only search back to the "t" (or twice that far, depending on how you look at it), since we're assured that there are no previous occurences of tU. Another performance improvement is to create an auxiliary string, as follows. wherever the string "abaca" occurs in the original, we put a zero byte in the auxiliary string, the non-zero bytes in the auxiliary string are simply the lengths between the occurences of "abaca". This auxiliary string can be used to greatly speed the part of the single-character algorithm that searches back looking for a match. BTW - I think I've found a place where the string begins "abacaXcbacaX", where X is several tens of thousands of characters in length. That is, the whole damned string very nearly repeats! If anyone wants to discuss this problem 'interactively', let me know....
203.47		SPRITE::OSMAN		`Tue Oct 08 1985 15:31`	23
	A vax-specific performance idea in the standard case: By standard case, I mean making sure the new character differs from previous, then make sure the last pair differs from next to last pair, then make sure last triplet differs from previous triplet, etc. through making sure last half of string differs from first half. From the vax point of view, perhaps we should represent the string as 8-bit bytes, one per character, and let the string grow "backwards", i.e. towards smaller-valued addresses. That way, we can start with newest character, and use a single vax instruction to search for previous occurances of that character in string. Only substrings ending at such spots need be checked. For that matter, there's most likely a vax instruction that actually searches for a substring. So we can search for previous occurance of last "small piece". Only such occurances need be checked more carefully, and for sufficiently large "small pieces", very few occurances are likely to be found. Is this useful ? /Eric
203.48		BEING::POSTPISCHIL		`Tue Oct 08 1985 17:20`	15
	Re .46: Doesn't finding a unique suffix string take a long time? Without any auxiliary strings, do you know how many comparisons you would have to make to determine a character being added causes no repititions? Do you use more than one auxiliary string or replacements for more than one string in the auxiliary string? How long (in CPU-time and real time) did it take to find however many characters you have found so far? -- edp
203.49		RAINBO::GRANT		`Tue Oct 08 1985 21:29`	74
	Good work so far, everybody! I've been away and just catching up. I have discovered and tested one procedure for generating a long NR string that after 20000 characters has only had single and double backups. It is certainly not the lexically first string. The procedure goes something like this: Start with a letter, and a direction. (a direction is either A->B->C->A... or C->B->A->C..) Generate the string by "travelling" in one direction as far as you can without duplicating a substring, then switch directions. If you get stuck, you need to backup, find what the old direction was, and switch that. I have it coded in C (the code is carved out from the other C program in a previous response). I will put the code as a response if I get a mail request, or the printout of the first N characters. It seemed like the procedure would be interesting to the people who are trying to prove that some procedure generates an infinitely long member of NR. That the procedure so far has only single and double backups suggests that it might be more easily proven to be never-ending. There were 69 double backups in 20000 characters. Here are the first few steps of the procedure: Start with A, and travel "up". A AB ABC ABCA ABCAB ABCABC nope. switch direction. ABCAB A ABCAB AC ABCAB ACB ABCAB ACBA ABCAB ACBAC nope. switch. ABCAB ACBA B ABCAB ACBA BC ... ABCAB ACBA BCABC nope. switch. ABCAB ACBA BCAB A ABCAB ACBA BCAB AC ... ABCAB ACBA BCAB AC ABCA CBAC ABCA ABCAB ACBA BCAB AC ABCA CBAC ABCAB nope. switch. ABCAB ACBA BCAB AC ABCA CBAC ABCA C oops. This doesn't work either. backup. ABCAB ACBA BCAB AC ABCA CBAC ABCA ABCAB ACBA BCAB AC ABCA CBAC ABC B We had been going "up" back here. switch. ABCAB ACBA BCAB AC ABCA CBAC ABC BA ... ABCAB ACBA BCAB AC ABCA CBAC ABC BACB CABC BAC ABCA CBAC ABC BACB CAB ACBA .... The next few backups are around characters 99, 156, 228, 289, 361, 400, and 472. The first double-backup happens at around character 2112. Double backups usually occur in clumps of 4 or 5, each 101 apart. There are many other near-regularities like this that are intriguing. I am next seeing if there is a simple way to modify the procedure to have either no double-backups or no backups at all. I am looking at more complicated versions of "direction" that would somehow indicate usefully the recent state of the string. Looking at the print out and thinking about this puzzle has led to lots of new thoughts. Thanks to the originator. I can see that this problem has only begun! -Jim Grant, Littleton MA.
203.50		R2ME2::GILBERT		`Tue Oct 08 1985 22:58`	64
	If you want a fast program to generate an arbitrarily long string in NR (and you aren't choosy about whether it's lexically first), the following program will generate N characters in O(NlogN) time and O(logN) space. Within a constant of proportionality, this is best possible. Can we now return to matters of importance, like whether the regular back-tracking algorithm will ever back up more than 5 characters, or how to quickly generate the first N characters of the lexically first string? module fast_nr (main=main, addressing_mode(external=general)) = begin external routine lib$get_vm, lib$put_output; bind convert = uplit( uplit(%asciz'abacabcacbabcbacabacbabc'), uplit(%asciz'abacabcacbabcabacbabcacbc'), uplit(%asciz'abacabcacbabcabacbcabcbabc')): vector; own stack; routine getch(ps) = begin bind s = .ps : ref vector[2]; local x; if s[0] eql 0 then begin lib$get_vm(%ref(2*%upval), s); s[0] = 0; s[1] = 0; return 'a'; end else if .s[1] eql 0 then begin s[1] = .convert[getch(s[0])-'a']; ch$rchar_a(s[1]); ! Skip the 'a' end; x = ch$rchar_a(s[1]); if .x neq 0 then return .x; s[1] = .convert[getch(s[0])-'a']; x = ch$rchar_a(s[1]); return .x; end; routine main = begin literal k_line = 80; local buff: vector[k_line,byte], desc: vector[2] initial(k_line, buff[0]); stack = 0; while 1 do begin incr i from 0 to k_line-1 do buff[.i] = getch(stack); lib$put_output(desc[0]); end; end; end eludom
203.51		R2ME2::GILBERT		`Tue Oct 08 1985 23:09`	6
	re 203.48: Finding a unique suffix string takes about as long as the search-back to check one character. The algorithm I'm using looks for a unique suffix every 1024 characters. The only use made of this is to limit the search-back amounts, and this is the only automated optimization that's working so far.
203.52		LATOUR::AMARTIN		`Tue Oct 08 1985 23:48`	39
	Re .45: >You're not using "DFA" for "deterministic finite-state automaton", are you? Yes, I am. >If so, there exists a single automaton capable of recognizing all finite >members of NR, Can you prove that the set of all finite strings in NR is regular? I haven't yet seen a proof that the language is context-sensitive or context-free, let alone regular. >although it requires some sort of infinite memory (such as the >tape for a Turing Machine). If so, it isn't a DFA. The only memory in a DFA is encoded in the current state number, and the set of states is finite. >On the other hand, things such as finite-state parsers will get larger >as the members of NR to be recognized increase. The increase will be >at least exponential: Indeed, it can't be any worse than exponential. It only takes 3k states to recognize any subset of strings of length <= k from a 3 symbol alphabet. >At a state in the middle of parsing a string, the parser >state will need to "encode" at least the last l/2 characters (where the length >of the string being parsed is l), requiring at least 3(l/2) states. Exactly the last l/2 characters, since the parser has only seen l/2 characters. The proof claims that the DFA for finite members of NR of length <= k has O(3*k) states. But the same proof can be used to prove that the DFA for all members of (a+b+c) with length <= k has O(3**k) states. (The proof makes no distinctions about the set of strings being recognized). Yet we know that the DFA has k+1 (O(k)) states. I don't believe the proof. /AHM
203.54		R2ME2::GILBERT		`Wed Oct 09 1985 00:59`	14
	re 203.48: Here are the accumulated CPU times for the algorithm, at 10K intervals: 10K - 1:32 20K - 4:53 30K - 10:04 40K - 25:43 50K - 32:25 60K - 53:24 From about 40K to 60K, the unique suffix stayed at 40K, so the going's slow..., it's practically duplicating the entire string! The algorithm will speed up for a little while after this hurdle.
203.55		BEING::POSTPISCHIL		`Wed Oct 09 1985 09:01`	48
	Re .51, .54: After I entered my response, I came to the same conclusion you did about the unique suffix string: It takes as long to find as the process of validating one new character. In fact, the two processes can be performed simultaneously, except that the search for the unique suffix apparently needs to continue back to the exact middle of the string, rather than stopping at the previously used unique suffix string. If we could find some way to show that the search could stop at that unique suffix string, we would be finding unique suffixes with almost no extra work. Thanks for the CPU-times, at least the task seems to be moderately reasonable. Re .52: >> At a state in the middle of parsing a string, the parser >> state will need to "encode" at least the last l/2 characters (where the length >> of the string being parsed is l), requiring at least 3(l/2) states. > > Exactly the last l/2 characters, since the parser has only seen l/2 > characters. The "middle" was not intended to mean exactly the middle. Once the parser gets beyond the exact middle, it must record more than the last l/2 characters. Some extra information must be known about what characters have matched previous characters. > The proof claims that the DFA for finite members of NR of length <= k > has O(3k) states. But the same proof can be used to prove that the DFA > for all members of (a+b+c)* with length <= k has O(3k) states. (The proof > makes no distinctions about the set of strings being recognized). Yet we know > that the DFA has k+1 (O(k)) states. I don't believe the proof. By "the proof", are you referring to the short remarks I made showing at least O(3k) states are required? If so, it certainly does make a distinction about the set of strings being recognized, although it is implicit (when I present a proof, you'll know it). When I stated 3(l/2) states were required, I did not say why. Consider the parser after at least l/2 characters have been read. For each of the 3(l/2) possible strings of l/2 characters that might have been just read, there exists a string of l/2 characters that could come next to invalidate the string. Each of these future-strings is different for each of the past-strings, so the parser must know, as it starts to examine each of the future-strings, _exactly_ which past-string was read. Hence, at least 3**(l/2) states are required. -- edp
203.56		GOLLY::GILBERT		`Wed Oct 09 1985 12:40`	17
	Re .51, .54, .55 (unique suffix strings): The starting position of the unique suffix string never moves leftward, and tends to stay in the same place during the generation of thousands of characters. Thus, looking for a new unique suffix once every 1000 generated characters is relatively cheap. It could be improved by checking whether the previous unique suffix is still the shortest unique suffix, by checking whether we're still generating the same stream of characters as were generated at the occurence of an "sUV" in the string. But improving this aspect doesn't seem particularly fruitful. Better would be to discover why the string is repeating, and copy the entire repetition in one fell swoop. By the way, I have some long equivalents of Bliss' ch$compare, ch$find_sub, and ch$move, if anyone is interested. These work with lengths up to 4 billion, instead of only working up to lengths of 65K. Send mail if you want 'em. - Gilbert
203.57		TOOLS::STAN		`Wed Oct 09 1985 16:15`	4
	Here's a reference to the original problem: Yaglom and Yaglom, Challenging Mathematical Problems with Elementary Solutions, volume 2, Holden-Day, problems 124 and 125.
203.58		SPRITE::OSMAN		`Thu Oct 10 1985 10:54`	16
	My source was MIT Articifical Intelligence Lab Memo #239 entitled "Hakmem". It's a collection of technical and mathematical problems, discoveries, observations etc. I have finally gotten BUMP working. The result is: Searching s(22840) or whatever that long string was that Mike Gilbert published, ALL VALID SUBSTRINGS through CBACB are found in it. CBACB is the first one not found ! Question: Will CBACB appear in lexically-first string ? Exactly where does it appear. If not, can someone prove why it will never appear ? /Eric
203.59		BEING::POSTPISCHIL		`Mon Oct 14 1985 18:11`	28
	In a procedure that is looking for the lexically-first infinite string of NR, suppose there is some sort of list of potential repetitions and that the current length of the string is l. I think all the procedure needs to do is check the (hopefully small) list of potential repetitions to see if each one continues, completes, or fails at this point and then check to see if any new potential repetitions have cropped up. To guarantee that any potential repetition is found at or before the time it is completed, there are only a few comparisons that need to be made for each position: Compare the character in position l-1 (with numbering starting at 0) to the character at position l-1-f, for each f less than l that divides l. If no match is found, there can be no potential repetition of length f ending anywhere in the next f characters, after which time the procedure will again check for possible repetitions of length f, because f divides l+f. If a match is found, it will be necessary to search back some ways, since a repetition may have started quite some time ago, although if it has completed, it must have completed only with the addition of the most recent character. There are some "minor" problems with this routine. For example, when a member of the list of potential repetitions is disqualified because it does not match a character, it cannot always be thrown away -- a conflict in the future might cause the procedure to backup beyond the disqualifying point, so they must be retained until they are relatively safe to discard (five positions?). Other than similar things, it looks quite promising, and I am working on implementing it. -- edp
203.60		KOBAL::GILBERT		`Tue Oct 15 1985 11:30`	66
	Re 203.59: I'm not sure I understand the comment about f being a divisor of l. Is this because the algorithm only checks for repeated substrings of length f every f characters? Re displaying strings: A Lempel-Ziv style algorithm gives another convenient way to 'display' the string. For example, if we let "(offset,length)" indicate the string offset and length, then the first 1095657 characters can be written: abacabcacbabcabacabcacbacab(7,17)bcabacb(17,29)(22,22)(24,26)(42,38) (46,28)(42,29)bcab(28,22)(115,111)(126,84)c(220,22)ac(109,17)(226,124) (226,108)(434,175)(22,58)ab(877,35)(464,233)(436,447)(1123,336)(200,105) (213,75)(118,222)(3328,63)(1211,2139)(2125,293)(213,61)(6735,2416) (1196,1156)(3350,2578)(3396,2429)(1196,837)(12815,5007)(935,1054) (9182,15322)(5927,538)(40036,270)(6465,34115)(6465,33847)(981,2369) (2125,73)(198,152)(226,94)(883,5045)(935,43)(912,2490)(883,2508) (6750,68215)(40580,485)bacab(1208,538)(190287,268)(436,255)(2008,166) (18697,4983)(196,33)b(874,5054)(877,2511)(195368,8684)(196534,6303) (12815,3839)(3350,83)(118,222884)(190290,32709)(190294,32700) (190293,32696)(24,543949)(949,1153)(6687,6557) Please forgive any typos. Also, an independent verification of the first million characters is requested. Re commonly occuring substrings: Can two occurences of the string "aba" be arbitrarily far apart in an infinitely long string of NR? I think so. However, it appears that there must be an infinite number of occurences of the string "abc" (that is, there's a small bound on how far apart they can be). Re backing up: How far might we need to back up? Note that, to produce a 'bad' pattern that causes a large backup, there must be several repeated substrings of diffferent lengths (and with lengths greater than, say 6), all ending at about the same place, e.g.: [-----------][-----------] [---------------][---------------] Now we can equate parts of these strings, and we see: [defghij----][defghij----] [--------defghij][----defghij----] -or- [----defghij----][----defghij----] Thus, we see that we must've already produced a repeated substring (defgdefg). That is, the longer of these two repeated substrings is 'too short'. At the smallest it must be about twice as long as the shorter repeated substring: [defghijklm-][defghijklm-] [fghijklm---defghijklm-][fghijklm---defghijklm-] With a little precision, we should be able to limit the number of repeated substrings that can end within a small range of r characters to something like log2(l)-log2(r), where l is the length of the string generated so far. Because a backup of length greater than r requires 3r repeated substrings, and a significant number of these repeated substrings (how many?) must have length greater than 2r, we should be able to limit the maximum backup amount to something like log2(l). - Gilbert
203.61		BEING::POSTPISCHIL		`Tue Oct 15 1985 12:02`	10
	Re .60: > I'm not sure I understand the comment about f being a divisor of l. Is this > because the algorithm only checks for repeated substrings of length f every > f characters? Yes. -- edp
203.62		R2ME2::GILBERT		`Wed Oct 16 1985 12:28`	15
	abacabcacbabcabacabcacbacab(7,17)bcabacb(17,29)(22,22)(24,16)(42,38)(46,28) (42,29)bcab(28,22)(115,111)(126,84)c(220,22)ac(109,17)(226,124)(226,108) (434,175)(22,58)ab(877,35)(464,233)(436,447)(1123,336)(200,105)(213,75) (949,1176)(126,45)(6,18)(881,2515)(1243,503)(423,34)(118,222)(3328,63) (1211,2139)(2125,293)(213,61)(6735,2416)(1196,1156)(3350,2578)(3396,2429) (1196,837)(12815,5007)(935,1054)(9182,15322)(5927,538)(40036,270)(6465,34115) (6465,33847)(981,2369)(2125,73)(198,152)(226,94)(883,5045)(935,43)(912,2490) (883,2508)(6750,68215)(40580,485)bacab(1208,538)(190287,268)(436,255) (2008,166)(18697,4983)(196,33)b(874,5054)(877,2511)(195368,8684)(196534,6303) (12815,3839)(3350,83)(118,222871)(24,445831)(949,1153)(121523,68735) (9182,30860)(892863,68736)(190258,255622)(222989,121424)(892864,544175) (219087,120355)(201595,11189)(204052,1216)(111502,5053)(115,229) (2115176,5272)(434,310)(2115176,10851)(6465,439415)(222989,5832) (190292,38522)(992523,443260)(195368,14948)(2620660,453155)(126,224) (874,445006)(222989,639)(6201,22108)
203.63		R2ME2::GILBERT		`Wed Oct 16 1985 14:05`	50
	The preceding represents the first 4,000,000 characters of the lexically first infinitely-long string in NR, and is displayed using the technique described in 203.60 (routines for decrypting and displaying this format will be available shortly). Excluding time spent debugging and tuning, generating these 4M characters took under 4 CPU hours. The biggest performance gain was by using the extension technique in 203.40. Most of the extensions were of several thousand characters, and some were of hundreds of thousands of characters. At some places in the string, the search-back was taking 5 or 6 seconds to check a single character! (and it only searched back to (twice) the length of the shortest unique suffix string). There were many 'backup' amounts of 5 (one at offset 2582138 was interesting), but no 'backups' of 6 or more. I need to re-run the program to collect information about 'search-back' amounts, as I think they may be useful in understanding how this string behaves, and in displaying the string. The display used in 203.62 is not particularly useful, although it is terse (compare with 203.7, which is only 0.5% the length). Because it searches for longest matches, part of a long match may obscure the fact that the next chunk starts at a very small offset; for example, I suspect (24,445831) really represents a repetition that started at offset 2. The macro display technique in 203.31 is better (patterns are readily visible), but the particular choice of macroes is poor, because some macroes are prefixes of others. Can a better choice of macroes be found? It seems reasonable that there might be a small set of macroes that could be used, such that the set has the 'uniquely identifying substring' property, as described in 203.8. That is, any 'long' repeated substring must 'split' between the two macroes. This should give a better set of macroes (so that the string can be displayed better), and might mean that the the algorithms could be applied to the macro names, instead of the characters a, b and c. Looking at the 'search-back' amounts, and the strings that nearly repeat should help find such a set of macroes (if they can indeed be found). It is still interesting to wonder why there are so few different 'search-back' amounts. There are only 10 different 'search-back' amounts greater than 64 in the first 85K characters (see notes 203.15 and 203.31). This means that, within the first 85K characters, at most 74 (64+10) string comparisons are needed to check whether a character can be appended to the string. Notes 203.23 and .24 ask about counting the number of 'valid' substrings of a particular length. Can any upper or lower bounds be given? How does the function seem to grow? Also, could Eric's BUMP program be used to determine whether there's a small bound on how far apart occurences of "aba" may be? - Gilbert
203.64		SPRITE::OSMAN		`Wed Oct 16 1985 16:18`	32
	Here's an algorithm that I believe will test to see if a new last letter is valid fairly efficiently. -------------------------- beginning of algorithm --------------------- Start with test substring equal to last letter. Start with position to search backwards from as end of string. [A:] Search backwards for second occurrence of substring. If substring not found, entire string is valid. If substring is found in position just to left of first occurence, entire string is invalid. We found second occurrence farther back than immediately abuting first occurrence, so . . . Set test substring to everything from just to right of second occurrence through last letter. Set position to search from to be just to right of that second occurence. Go back to [A:]. --------------------------- end of algorithm ------------------------- If we arrange string BACKWARDS in memory, then the vax MATCHC instruction can be used for the searching. I'm experimenting with an implementation of this. /Eric
203.65		BEING::POSTPISCHIL		`Wed Oct 16 1985 18:27`	18
	The procedure I described previously (search for partial repetitions of length f every f characters and observe them as more characters are added to make sure they do not complete) requires, when finding the first 1000 characters, an average of 10.5 comparisons of one character to another to determine whether a new character makes the string invalid or not. However, it is still slower than Gilbert's program. Either Gilbert makes up for comparisons by occasionally being able to copy large stretches of the string or I lose by the overhead involved in factoring numbers and handling data structures. I will be working on the program some more, since there are lots of places I know improvements can be made, but if anybody wishes to look at it, everything is in BEING""::USER1:[POSTPISCHIL.PUB]. The sources are in .H and .C. Ignore the X.C file, it is a version of NR.C that collects some extra statistics as it works. The program is in NR.EXE. Run it and type in the number of characters you would like it to find. -- edp
203.66		LATOUR::AMARTIN		`Wed Oct 16 1985 22:48`	26
	Re .63: There is a paper titled "Analyzing and Compressing Assembly Code" by Christopher W. Fraser, Eugene W. Meyers and Alan L. Wendt of U of Arizona on pp 117-121 of the proceedings of the Sigplan '84 Symposium on Compiler Construction (which was published as Vol 19, #6 (Jun-85) issue of SIGPLAN Notices. The authors describe an algorithm to compress straight-line assembly code into subroutine calls to save space (the algorithm looks for patterns in the code, and creates subroutines out of repeated code sequences). The paper talks about using a Patricia tree to find the patterns. References for this point to: 3. D. E. Knuth, "The Art of Computer Programming, Volume 3: Sorting and Searching" (for Patricia trees). 4. E. M. McCreight, "A Space-Economical Suffix Tree Construction Algorithm", JACM V23, #2 (Apr-76), pp 262-272 (also for Patricia trees). 5. M. Rodeh, V. R. Pratt and S. Even, "Linear Algorithm for Data Compression via String Matching", JACM, V28, #1 (Jan-81), pp 16-24. (A different approach?). Perhaps there is something worthwhile in these references. I haven't looked at the 3 latter papers, so I don't know if any of this is redundant. /AHM
203.67		ALIEN::POSTPISCHIL		`Thu Oct 17 1985 13:37`	20
	I've got my program into something resembling a decent shape. It takes 34 CPU-seconds (on a VAX-11/780) to find the first 10,000 characters, as opposed to Gilbert's 92 seconds. Better than that, it appears to be increasing only a little more than linearly: number actual (number/2000)* found time (time for 2000) 1000 -- 2.9 2.9 2000 -- 5.9 5.8 3000 -- 9.0 8.7 4000 -- 12.1 11.6 5000 -- 15.6 14.5 10000 -- 34.2 29.0 Gilbert, what type of CPU were your times for? What language did you work in? If everything goes well, I will be ready to compare our first million characters soon. -- edp
203.68		ALIEN::POSTPISCHIL		`Thu Oct 17 1985 13:57`	5
	Gilbert's first 22,528 characters agree with mine. It took my program 86.6 CPU-seconds to find them (86.6/34.2 = 2.53, 22528/10000 = 2.25). -- edp
203.69		R2ME2::GILBERT		`Sat Oct 19 1985 23:28`	8
	Two occurences of "abc", can be separated by no more than 25 characters. But two occurences of "aba" can be 1000 characters apart (by construction). Can we prove that they can be arbitrarily far apart? Note that the following simultaneous substitutions could almost (but not quite) be used to prove this: a -> abc b -> ac c -> b
203.70		TOOLS::STAN		`Sun Oct 20 1985 13:06`	28
	Some more references: Richard K. Guy, Unsolved Problems in Number Theory. Springer-Verlag. New York: 1981. pp 124-125. Problem E21. F. M. Dekking, On repetitions of blocks in binary sequences. J. Combin. Theory Ser. A, 20(1976)292-299. F. M. Dekking, Strongly non-repetitive sequences and progression-free sets. J. Combin. Theory Ser. A 27(1979)181-185. R. C. Etringer, D. E. Jackson and J. A. Schatz, On non-repetitive sequences. J. Combin. Theory Ser. A 16(1974)159-164. D. Hawkins and W. E. Mientka, On sequences which contain no repetitions. Math. Student 24(1956)185-187. G. A. Hedlung, Remarks on the work of Axel Thue on sequences. Nordisk Mat. Tidskr. 15(1967)147-150. MR 37 #4454. Helmut Prodinger and Friedrich J. Urbanek, Infinite 0-1 sequences without long adjacent identical blocks. Discrete Math. 28(1979)277-289. A. Thue, Uber unendliche Zeichenreihen, Norse Vid. Selsk. Skr. I Mat.-Nat. Kl. Christiania (1906), No. 7, 1-22. A. Thue, Uber die gegenseitige Lage gleicher Teile gewisser Zeichenreihen, ibid. (1912), No. 1, 1-67.
203.71		DEMON::OSMAN		`Mon Oct 21 1985 15:46`	6
	Unfortunately, the algorithm in 203.64 seems to be awful. Apparently, the search matches in more places than I expected. Hence we end up doing to much work. /Eric
203.72		DEMON::OSMAN		`Mon Oct 21 1985 15:48`	9
	I'm curious about that problem E21 in "Unsolved Problems in Number Theory" (see 203.70). Is the "unsolved problem" that of proving the existence of arbitrarily long NR strings ? In which case Gilbert is now famous ? If not, can someone type in the problem ? /Eric
203.73		TOOLS::STAN		`Mon Oct 21 1985 16:06`	3
	Problem E21 is a hodge-podge of known results and open problems. A typical open problem is the one I entered here in MATH.NOT as note 360.
203.74		ALIEN::POSTPISCHIL		`Mon Oct 21 1985 17:45`	23
	Re .71: I'm not sure I understand the algorithm in .64. I thought I did until I saw .71. There is a simple way to check to see if the most recently-added character caused a repetition, and it requires a number of character-to-character comparisons less than or equal to half the length of the string. Set i to the last position in the string. Set j to the previous position. [Label A] Compare character i to character j. If they are equal and the length of the string minus j equals j minus i, the string has a repetition, so quit. If they are equal, decrement i. Otherwise, set i to the last position in the string. Decrement j. If j is non-negative, go to A. The string has no repetition, quit. Gilbert has improved upon this by limiting the search, so that it is not necessary for j to go all the way to zero. He also is able to avoid the check sometimes by copying many characters from previous places in the string. I have improved on this by coordinating my checks for substrings based on their lengths. -- edp
203.75		SPRITE::OSMAN		`Tue Oct 22 1985 14:08`	28
	There's something wrong with .74. Here's the beginning of it, reprinted: >Set i to the last position in the string. Set j to the previous position. >[Label A] Compare character i to character j. If they are equal and the >length of the string minus j equals j minus i, the string has a repetition, >so quit. Let's try it. Suppose we've got ...AA and we're using .74 to check it. Let's assume we've now got 100 characters. "Set i to the last position in the string." i = 100 "Set j to the previous position." j = 99 "Compare character i to character j. Compare "A" with "A". "If they are equal" yup! "and the length the string minus j" 100-99 is 1 "equals j minus i" 99-100 is -1 Not equal! Algorithm hence doesn't quit here, and I would have expected it to. /Eric
203.76		BEING::POSTPISCHIL		`Tue Oct 22 1985 15:20`	9
	Re .75: Oops, that should read "and the length of the string minus i equals i minus j, the string has a repetition . . .". Also, start counting positions at zero (so the last position in a string of length 100 is 99) or insert a "plus one" after "length of the string". -- edp
203.77		R2ME2::GILBERT		`Tue Oct 29 1985 21:19`	52
	I happened to run across this problem in "Jewels of Formal Language Theory", by Arno Salomaa (Computer Science Press). I just happened to be looking for a nearby book in the ZK library, happened to pick this up, and this was the first problem! It's known as Thue's Problem, after Axel Thue, a formal language theorist of the early 1900s. There are actually three problems; we've been looking for square-free w-words (omega-words are 'infinitely long' strings) over an alphabet of 3 letters. Similarly, there's the problem of finding cube-free w-strings. Here's an outline of how the problem was solved there: Define the sequence of words w(i+1) = w(i)w'(i), w(0) = "a", and x' denotes the string obtained from x, by interchanging "a" and "b". Thus, w(5) is: a b ba baab baababba baababbaabbabaab (spaces for readability). Let y be the w-word generated by this (note the similarity between this string and the 'parity' of the integers). Then, it's shown that y is cube-free. Now, we define another w-word, z, based on adjacent characters in y (not really pairs though -- this is a length-preserving homomorphism), thus: [aa] -> 1, [ab] -> 2, [ba] -> 3, [bb] -> 4. In this notation, z begins: 2432312431232432312324312432312... Note that 1 is always preceded by a 3 and followed by a 2; a 4 is always preceded by a 2 and followed by a 3. The string z is square-free. By replacing each occurence of 4 in z with a 1, we get a square-free string over an alphabet of 3 letters. A stronger notion that square-freeness is also considered. A w-word over an alphabet of n letters is called "irreducible" if, whenever it contains a subword xyx (where x is non-empty), then \|y\| >= n-2. For n=3, this is equivalent to square-freeness. Also, the following theorem is mentioned: There is a w-word w over an alphabet of n letters such that, whenever xyx (where x is non-empty) is a subword of w, then \|y\| >= \|x\|/3. This result is proved by starting with "a" and using the substitutions (such a deterministic generation is called a D0L system): a -> abc acb cab c bac bca cba b -> bca bac abc a cba cab acb c -> cab cba bca b acb abc bac The 1/3 is best possible, and the above substitutions cannot be 'shortened'. The string y above (2432...) can be generated by the D0L system: 1 -> 2431, 2 -> 2432, 3 -> 3123, 4 -> 3124, starting with 2. The following morphism is what Thue had originally used: a -> abcab, b -> acabcb, c -> acbcacb. The 'permutation-square-free' problem isn't mentioned.
203.78		R2ME2::GILBERT		`Sat Nov 02 1985 12:17`	13
	While it lasts, the file HARE::SYS$PUBLIC:ABC.A is a backup save-set containing a variety of routines I've used in working on the NR and NP problems (including routines to compress/decompress the strings in the LZ-style format). The save- set also includes a DESCRIP.MMS file for building the various executables. The programs for finding the (beginning of the) lexically first w-string in NR and NP initialize the string with what's known so far (see NRSTRING.REQ and NPSTRING.REQ), and continue from there. If any errors are found (particularly in the UTIL$ or CH$ routines), please let me know (preferably by MAIL). - Gilbert
203.79		R2ME2::GILBERT		`Sat Nov 02 1985 13:08`	42
	Since we've the first 6 million characters (I'm still puzzled about how Eric Postpischil's program got so far so fast extending only a single character at a time), it's about time we started to seriously analyze the string, trying to find some patterns. For example, assuming the string can be generated by a D0L system (i.e., just a bunch of substitutions on successively longer strings), how long must the substitution strings be? What if each step of the substitutions takes the string (x), makes substitutions for each of the characters, and then prefixes the result with some short string? (this will solve the problem with the first 20 characters never reappearing). What if, instead of doing substitutions of 4 different characters in each step, we have (say) 10 different characters, and then to get the string, we do a simple substitution of these to get just 4 characters? With the decompress subroutines (see 203.78), Eric Osman should be able to find the shortest substring that doesn't occur in the first 6M characters (but which could occur). Could someone give a plausible argument to explain the character frequencies (note that "b"s occur significantly more than 1/3 of the time); also, what are the frequencies for letter pairs (do "ac" and "ca" occur with roughly equal frequency?). The ratios of "a"s/"b"s is not equal to the ratio of "b"s/"c"s. I'd expect these to be about the same (since the probability that an "a" can be appended should be about the same as that for appending a "b" or "c" -- or should it, given the different frequencies of the characters?). Maybe the probabilities start out about the same, but backing up causes "a"s to be more likely? This idea may help explain the letter frequencies. I'm unsure whether I was correctly figuring the backup amounts -- it now seems that there were backups of 9 characters (within the first 1000 generated). could someone independently check this? Asymptotically, how many strings are there of length n? When appending a character, there's a 1/3 chance (roughly) it's disallowed by a repetition of length 1, and a 1/4 (?) chance it's disallowed by a repetition of length 2, and a 1/8 (?) chance it's disallowed by a repetition of length 3. In all the strings of length n, do "abca" and "abac" occur equally often? - Gilbert
203.80		SPRITE::OSMAN		`Mon Nov 04 1985 15:44`	73
	I have a respectable time now for a program that computes the LFS (lexically-first string). I never did understand you guy's "extension technique", so my program doesn't use it. This, I believe adds to the respectability of my time. Here's a tally of the times I know about: Who Number of characters time note, date --- -------------------- ---- ----------------- SPRITE::OSMAN 10K 00:00:49 203.80 11/4/85 40K 00:09:49 * 203.80 BEING::POSTPISCHIL 10K 00:00:34 * 203.67 10/17/85 40K ??????? GOLLY::GILBERT 10K 00:01:32 203.54 10/9/85 40K 00:25:43 203.54 * indicates currently known record Feel free to reproduce the table in other replies with updates as necessary. I've been working on my program gradually. Although it's not very long, it has virtually no comments in it (yeah, I know, booo, hisss). It's a combination of BLISS and C, plus Gilbert's CH$ module. The only reason I use BLISS instead of all C, is C's refusal to allow me to get at in-line use of INSQUE and CMPC3 and MATCHC instructions. The main features of my program: o The program works with a "group size"(gs), typically 4. This is the number of characters appended at once. o The program initializes a "groups" table which contains all possible strings of gs characters. For gs=4, the table contains ABAC, ABCA, ABCB, ACAB, ACBA, ACBC, BABC, etc. Then program initializes "next_abut" table which indicates which group of four characters to try next if a group fails. For instance, the table indicates that if we've just decided that "...ABAC ABCB" is invalid, the table tells us what to replace the "ABCB" with. Even though "ACAB" is next in the group table, the next_abut table automatically skims us over that "obvious" invalidity of "...ABAC ACAB". o The program works with a "longsize", typically 8. When checking for validity, substring lengths 1 through longsize are checked the standard way, namely by using CH$EQL. For longer substrings, CH$COMPARE_L (Gilbert's generalized CH$COMPARE) is called only for substrings ending with the same 8 characters as the current last 8. The locations of these are remembered in linked lists as I go, rather than having to be scanned or searched for. The times I reported up above for my program were computed from: $ show process/accounting run program $ show process/accounting The "elapsed CPU time" is subtracted. My time includes the time it took the program to print out the string, which was done in batch, so we're dealing with PRINTF to SYS$OUTPUT which is a .LOG file. I haven't measured how much time is spent doing just the output, but I suspect it's not much. I use a loop, printing 80 characters per PRINTF. /Eric
203.81		BEING::POSTPISCHIL		`Mon Nov 04 1985 17:02`	25
	Who Number of characters time note, date --- -------------------- ---- ----------------- SPRITE::OSMAN 10K 00:00:49 203.80 11/4/85 40K 00:09:49 203.80 BEING::POSTPISCHIL 10K 00:00:34 * 203.67 10/17/85 40K 00:02:48 * 203.81 11/4/85 1M 01:48:30 203.81 11/4/85 4M 08:52:42 203.81 11/4/85 GOLLY::GILBERT 10K 00:01:32 203.54 10/9/85 40K 00:25:43 203.54 4M 03:??:?? * ? ? * indicates currently known record My times are figured in the same way as Eric's, although I do not know that we are using the same type of CPU. They are all for the same program; the amount of time seems to be between n log n and n log**2 n. I think Gilbert's times are for different programs, and it may be hard to predict how they increase. -- edp
203.82		BEING::POSTPISCHIL		`Mon Nov 04 1985 19:16`	27
	Re .62, .63: My results differ from Gilbert's. The compression of my result, using a routine provided by Gilbert, follows the form-feed. Strangely, our compressions differ in the second line but are then identical until the eleventh line. -- edp ABACABCACBABCABACABCACBACAB(7,17)BCABACB(17,29)(22,22)(24,16)(42,38)(46,28) (42,29)B(114,12)(4,16)(118,108)(126,84)C(220,22)AC(109,17)(226,124)(226,108) (434,175)(22,58)AB(877,35)(464,233)(436,447)(1123,336)(200,105)(213,75) (949,1176)(126,45)(6,18)(881,2515)(1243,503)(423,34)(118,222)(3328,63) (1211,2139)(2125,293)(213,61)(6735,2416)(1196,1156)(3350,2578)(3396,2429) (1196,837)(12815,5007)(935,1054)(9182,15322)(5927,538)(40036,270)(6465,34115) (6465,33847)(981,2369)(2125,73)(198,152)(226,94)(883,5045)(935,43)(912,2490) (883,2508)(6750,68215)(40580,485)BACAB(1208,538)(190287,268)(436,255) (2008,166)(18697,4983)(196,33)B(874,5054)(877,2511)(195368,8684)(196534,6303) (12815,3839)(3350,83)(118,222871)(24,445831)(949,1153)(121523,68735) (9182,30860)(892863,68736)(190258,255622)(222989,121424)(892864,743265) (878,660)(191480,5035)(1209,519)(9182,30820)(7533,1649)(2218000,1243) (154851,35435)(6680,33061)(40015,33837)(2218540,2189)(2221419,67893) (9182,13273)(949,1084)(971076,21385)(8968,8819)(2391451,15669)(2377177,63411) (2325994,189397)(118,445656)(2407120,110368)(2440588,39592)(9182,837) (967895,24566)(8968,6462)(118,445762)(222989,9018)(965232,27229)(8968,2656) (12815,13157)(1196,1098)(213,78)(121586,68700)(6680,19)(342,532)(350,234) (434,51)(883,97100)
203.83		GOLLY::GILBERT		`Mon Nov 04 1985 22:12`	12
	The difference on the second line is due to a minor change to the display routine. Namely, the routine's been changed to use shorter matched strings. If you check, you'll see that these are really the same. Regarding the problem on the eleventh line, it appears that you missed the 545377-character repetition, ending at offset 1982418. Although this isn't the first repetition that's greater than 65535 bytes in length (which is the longest length comparable with a single VAX CMPC instruction), I suspect you got lucky with the previous long repetitions, by deciding that you needed to back up to those characters and replace them with the next larger character. - Gilbert
203.84		BEING::POSTPISCHIL		`Tue Nov 05 1985 10:43`	15
	Re .83: I looked at the machine code produced by the C compiler, and, although I am not familiar with the VAX instruction set, it does not seem to be using the CMPC instruction or any other multiple-byte instruction. I guess that means there is a bug in my program, although it is strange for it to show up so far into the string. The length of the string at the point where our results diverge is 1982417, a prime number. I wonder if that has anything to do with it (my method relies heavily on factoring numbers). Also, the longest backup my program encountered was six (counting changing an "A" to a "B" as a backup of one). -- edp
203.85		R2ME2::GILBERT		`Tue Nov 05 1985 11:30`	4
	Oh. Now I realize why I'm figuring longer backup amounts. This is because of the way I've coded the extension technique, it sometimes shortens the string (by up to 10 characters), and that's figured into the maximum backup amount. I'll need to adjust the program. Sorry about the false alarm.
203.86		R2ME2::GILBERT		`Tue Nov 05 1985 20:04`	19
	I changed my program so that it doesn't display the LZ-style string every 256 characters, and timed it. Who Number of characters time note, date --- -------------------- ---- ----------------- SPRITE::OSMAN 10K 00:00:49 203.80 11/4/85 40K 00:09:49 203.80 BEING::POSTPISCHIL 10K 00:00:34 203.67 10/17/85 40K 00:02:48 203.81 11/4/85 1M 01:48:30 203.81 11/4/85 4M 08:52:42 203.81 11/4/85 CLT::GILBERT 10K 00:00:11 * 203.86 11/5/85 40K 00:00:28 * 203.86 11/5/85 1M 00:13:39 * 203.86 11/5/85 4M 01:15:00 * 203.86 11/5/85 * indicates currently known record
203.87		R2ME2::GILBERT		`Tue Nov 19 1985 11:46`	36
	The following gives the first 6000100 characters of the lexically first omega-word in NR, where NR is the set of all square-free strings over the alphabet {a,b,c}, and an omega-word is an arbitrarily long (or infinitely long, if you will) string. See also Thue's problem. Except for the parenthesized numbers, the notation is self-evident. The parenthesized expressions (offset,length) represent a string of 'length' characters at that point in the string, that are the same as the 'length' characters in the string, starting at offset 'offset'. Note that the (offset,length) is only used to reference a previously occuring substring. For a short example, the string: "abacabcacbabcabacabcacbacab" Could be so represented by: "abac(0,2)cacbabca(3,8)cab" Since characters 0 thru 1 are "ab", and characters 3 thru 10 are "cabcacba". abacabcacbabcabacabcacbacab(7,17)bcabacb(17,29)(22,22)(24,16)(42,38)(46,28) (42,29)bcab(28,22)(115,111)(126,84)c(220,22)ac(109,17)(226,124)(226,108) (434,175)(22,58)ab(877,35)(464,233)(436,447)(1123,336)(200,105)(213,75) (949,1176)(126,45)(6,18)(881,2515)(1243,503)(423,34)(118,222)(3328,63) (1211,2139)(2125,293)(213,61)(6735,2416)(1196,1156)(3350,2578)(3396,2429) (1196,837)(12815,5007)(935,1054)(9182,15322)(5927,538)(40036,270)(6465,34115) (6465,33847)(981,2369)(2125,73)(198,152)(226,94)(883,5045)(935,43)(912,2490) (883,2508)(6750,68215)(40580,485)bacab(1208,538)(190287,268)(436,255) (2008,166)(18697,4983)(196,33)b(874,5054)(877,2511)(195368,8684)(196534,6303) (12815,3839)(3350,83)(118,222871)(24,445831)(949,1153)(121523,68735) (9182,30860)(892863,68736)(190258,255622)(222989,121424)(892864,544175) (219087,120355)(201595,11189)(204052,1216)(111502,5053)(115,229) (2115176,5272)(434,310)(2115176,10851)(6465,439415)(222989,5832) (190292,38522)(992523,443260)(195368,14948)(2620660,453155)(126,224) (874,445006)(222989,639)(6201,68368)(3532256,513853)(126,48)(1061231,377010) (121522,68761)(190294,222870)(111500,78547)(4560153,445543)(190286,222406) (6465,24802)
203.88		SPRITE::OSMAN		`Tue Nov 19 1985 14:12`	5
	Does "CBACB" occur in that string ? If not, it's the shortest valid substring that seems to never occur. It's also the only valid substring of length 5 that seems to never occur. /Eric
203.89		SPRITE::OSMAN		`Tue Nov 19 1985 14:13`	4
	Also, have you still never seen a backup of more than 6 places ? (or is that no longer easily computable, given the extension technique) /Eric
203.90		FUTBAL::GILBERT		`Tue Nov 19 1985 22:01`	7
	Yes, "CBACB" occurs. The first occurence is at offset 2582135 in the string. Yes, I had a backup of 7. I'd originally had doubts, since after this occured, I changed the program (for other reasons) so that it misreported backup amounts, and I wasn't sure whether it was misreporting them, or all the previous stuff about backup amounts was wrong. Unfortunately, I've forgotten where in the string I found this long backup amount, although I suspect it's near "CBACB".
203.91		SPRITE::OSMAN		`Wed Nov 27 1985 12:13`	20
	Although I haven't broken any more records, here's an updated table showing my current bests, still avoiding using th extension technique. Who Number of characters time note, date --- -------------------- ---- ----------------- SPRITE::OSMAN 10K 00:00:21 203.91 11/27/85 40K 00:04:56 203.91 BEING::POSTPISCHIL 10K 00:00:34 203.67 10/17/85 40K 00:02:48 203.81 11/4/85 1M 01:48:30 203.81 11/4/85 4M 08:52:42 203.81 11/4/85 CLT::GILBERT 10K 00:00:11 * 203.86 11/5/85 40K 00:00:28 * 203.86 11/5/85 1M 00:13:39 * 203.86 11/5/85 4M 01:15:00 * 203.86 11/5/85 * best time so far /Eric
203.92	what's the simplest substitution recipe that produces arbitrarily long NR strings ?	HANNAH::OSMAN	see HANNAH::IGLOO$:[OSMAN]ERIC.VT240	`Mon Dec 16 1991 10:25`	46
	I conjecture that if you start with a string of a's, b's, and c's in which no substring adjacently repeats, the following substitutions will produce a longer one: a -> acb b -> acab c -> acbacab I haven't "proved" this, however I have a simple program that starts with "a" and makes and verifies the substitutions over and over until running out of memory: Good string, length = 1 Good string, length = 3 Good string, length = 14 Good string, length = 62 Good string, length = 276 Good string, length = 1228 Good string, length = 5464 Good string, length = 24312 Good string, length = 108176 Good string, length = 481328 Good string, length = 2141664 Good string, length = 9529312 Failed to malloc 42400576 bytes for new string. Questions: o Can someone provide a simple proof, using Gilbert's general proof in .8, that my substrings indeed are valid generators, or else find a counterexample of the form "start with X, and perform the substitutions n times, and you'll have an invalid string". o My substitutions are alot simpler than the ones Gilbert gave us in .8. Can anyone find even simpler ones ? How simple can we be ? A substitution must at least make the string longer in order to be "interesting". o <insert meaningful question here that relates the lexically-first string (see previous replies for definition> with strings generated by substitutions> Thanks. /Eric
203.93		ALIEN::EDP	Always mount a scratch monkey.	`Mon Dec 16 1991 13:34`	9
	Re .92: Using your substitutions, "abc", which has no adajacent repeating substring, goes to "acbacab acbacab". Also, doesn't "a" go to "acb" which goes to "acb acbacab acab", which contains both "acb acb" and "acab acab"? Hmm, was the string for "c" typed incorrectly? -- edp
203.94		HANNAH::OSMAN	see HANNAH::IGLOO$:[OSMAN]ERIC.VT240	`Mon Dec 16 1991 14:54`	11
	Whoops, embarrassment, bug in my program. My routine called "invalid" was coded to return 0 if string is valid, and i if string is invalid, with i pointing to start of the substring that has an adjacent duplicate. Unfortunately, if i was 0... I'll attempt to fix it by returning the i of the second string instead of the first, which can never be 0, which will hence disambiguate. Hang on for more news (if I indeed find some simple generators with the fixed program). /Eric
203.95	simplest possible generators ?	HANNAH::OSMAN	see HANNAH::IGLOO$:[OSMAN]ERIC.VT240	`Tue Dec 17 1991 16:34`	50
	I wrote a program to first fill a table with the simplest valid strings that start with "a" and end with "b", and then see which combinations of three of them can be repetively simultaneously substituted for a,b,c and produce a valid expanded string. The simplest valid strings the program found are: Filled strs[0] = "ab" Filled strs[1] = "acb" Filled strs[2] = "abcb" Filled strs[3] = "acab" Filled strs[4] = "abacb" Filled strs[5] = "abcab" Filled strs[6] = "acbab" Filled strs[7] = "abacab" Filled strs[8] = "abcacb" Filled strs[9] = "abcbab" Filled strs[10] = "acabcb" Filled strs[11] = "acbcab" Filled strs[12] = "abacbab" Filled strs[13] = "abcbacb" Filled strs[14] = "acabacb" Filled strs[15] = "acbabcb" Filled strs[16] = "acbacab" Filled strs[17] = "acbcacb" Filled strs[18] = "abacabcb" Filled strs[19] = "abacbcab" The program tried these in triplets, always starting with a seed of "a", and used the members of the triplet to substitute for a,b,c. Whenever the program could expand 6 times and still have a valid string, it printed out the triplet. Only one triplet passed the test, so presumably this is the simplest possible generator. That triplet is: Suspected good recipe: a->abcab b->acabcb c->acbcacb The program claims that all six possible triplets obtained by permuting the assignments pass the 6-level test. Of course, without more rigourous proof, it's still possible that this generator shows a duplication at some level deeper than 6, so the same sort of questions as before: o Can someone apply Gilbert's proof and hence determine if this generator is in fact valid ? Thanks. /Eric
203.96	"Great minds ..."	CLT::TRACE::GILBERT	Ownership Obligates	`Tue Dec 17 1991 18:36`	10
	Nice work, Eric! Could you apply this approach to the 'permutation-square-free' problem? (i.e., adjacent words are not permutations of each other). I checked note .77 to see what 'Gilbert's proof' might be. I found the following at the bottom: >The following morphism is what Thue had originally used: > a -> abcab, b -> acabcb, c -> acbcacb.
203.97	but proof is farther back, right ?	HANNAH::OSMAN	see HANNAH::IGLOO$:[OSMAN]ERIC.VT240	`Wed Dec 18 1991 09:11`	16
	>I checked note .77 to see what 'Gilbert's proof' might be. I found >the following at the bottom: But (your) Gilbert's proof isn't in .77, it's way back in .8, right? I'd still like to see the proof explicitly applied to the shorter generator I (and .77) specified, to make the proof a bit more clear. Thanks. /Eric p.s. If anyone wants my program in order to explore the permutation problem, I'll make you a copy for free.