T.R | Title | User | Personal Name | Date | Lines |
---|
203.1 | | TURTLE::GILBERT | | Thu Jan 03 1985 23:31 | 23 |
| The answer is "forever". Here is the beginning of the lexigraphically
first such sequence. It's broken up into pieces lines for readability.
ABACABCACBABC ABACABCACBAC ABACBABC ABACABCACBABC ABACBABCACBAC
ABACBABC ABACABCACBABCBAC ABACBABC ABACABCACBAC ABACBABC
ABACABCBABC ABACBABCACBAC ABACBABC ABACABCACBABC ABACBABCACBAC
ABACBABC ABACABCBABC ABACBABCACBAC ABACBABC ABACBCABCBABC
ABACABCACBABC ABACABCBABC ABACBABC
Note that the same pieces reoccur again and again (this, of course, must happen
in an infinite sequence).
Here is a constructive proof showing that such a sequence can be infinite.
Consider the three pieces: ABACABCACBABC, ABACABCACBAC, and ABACBABC.
Note that none of these pieces can be found in a concatenation of the pieces,
except as an exact match with itself. Furthermore, any two different pieces
can be placed next to each other. Thus, these pieces behave like the original
pieces, A, B, AND C.
So, take any sequence satisfying the conditions, and replace each occurence
of A, B and C by ABACABCACBABC, ABACABCACBAC, and ABACBABC, respectively.
This sequence also satisfies the conditions of the problem. QED.
|
203.2 | | SPRITE::OSMAN | | Thu Sep 12 1985 17:29 | 16 |
| I'm not convinced. Using your construction, it seems possible to me for
some duplicate repeating substring to appear which starts in the MIDDLE
of one of your sequences and ends in the MIDDLE of another.
For instance, consider
[----------------] [-------------] [-----------------]
[*************][***************]
The dashed sections represent portions generated by your method. The
starred lines located parts of your string that seem to me could be
a duplicated substring.
So it's not clear you've proved it yet.
/Eric
|
203.3 | | TURTLE::GILBERT | | Fri Sep 13 1985 10:27 | 6 |
| No, I think that that can't happen. However, the following can and does:
[-----------------][--------------]
[******][******]
Thanks for pointing this out. Any ideas?
|
203.4 | | BEING::POSTPISCHIL | | Fri Sep 13 1985 11:04 | 9 |
| Re .3:
Why do you say the situation posed in .2 cannot happen?
The situation given in .3 does occur: ABACABCACBAC ABACABCACBABC contains
ACAB ACAB. (Start with the last A of the first string.)
-- edp
|
203.5 | | TURTLE::GILBERT | | Sat Sep 14 1985 20:13 | 84 |
| Here is an outline of a proof.
Suppose we have strings s , s ,..., s , with each string composed of characters
1 2 n
in the alphabet {A,B,C}. Furthermore, suppose each s can be written as
i
s = t t , where the t are able to uniquely identify the s , in the
i i,1 i,2 i i
following sense. A substring t of s uniquely identifies the s , if, in any
i i
concatenation of the s, (s |s |...|s )*, containing exactly one occurrence
1 2 n
of s , there is exactly one occurence of t.
i
Let us suppose a string z of (s |s |...|s )*, contains a repeated, contiguous
1 2 n
subsequence of characters: z = z y y z , where the length of y is greater than
1 2
the length of any of the s . Now we can show that z must contain a repeated,
i
contiguous subsequence of characters: z = z x x z , where x is a concatenation
3 4
of the s, and is longer than any of the s. Suppose that the `split' between
the ys occurs in s , splitting it as follows:
i
<--t(i,1)--><--t(i,2)-->
<--------------y--------------><--------------y-------------->
Now because the y are equal, we know that the right-most part of y contains
an occurrence of t(i,1), and so we are able to infer the next few characters
following the second y, and can find the desired x:
<--t(i,1)--><--t(i,2)--> ... <--t(i,1)--><--t(i,2)-->
<--------------y--------------><--------------y-------------->
<--------------x--------------><--------------x-------------->
If the `split' occurs in a t(i,1), the demonstration is similar, and we extend
the first y leftward.
Now suppose we can find 3 such s , with uniquely identifying substrings t ,
i i,j
and suppose further that we are able to show that no string of (s |s |...!s )*,
1 2 n
with no s being adjacent to itself, contains a repeated, contiguous substring
i
of length less than the longest s . Then, we will be able to show that there
i
exists an abitrarily long string of {A,B,C}, with no repeated, contiguous
substring, since, given any such string of {A,B,C}, we can produce a longer
one by replacing each A, B, and C with s , s , and s , respectively (this
1 2 3
longer string is shown to contain to repeated contiguous substring, by the
above results, and a proof by contradiction). By the infinity lemma, this
shows that there exists an infinitely long string of {A,B,C} containing no
repeated, contiguous substrings.
The problem, then, is to find such s . Suppose we are presented with strings,
i
and asked to verify that they satisfy the conditions. This can be done in a
finite amount of time by only considering all strings of (s |s |...|s )* with
1 2 n
length less than 4 times the longest s , since this will suffice to find all
i
ways that the s could be combined to form contiguous duplications of t , or
i,j
any other repeated, contiguous substrings shorter than the longest s:
<---s---> ... <---s--->
<---t---><---t--->
Now, to find such s.
|
203.6 | | BEING::POSTPISCHIL | | Mon Sep 16 1985 10:24 | 19 |
| Re .5:
I think "the length of y is greater than the length of any of the s[i]" should
be changed to ". . . greater than or equal to . . .".
To help with finding appropriate strings, I have a program which prints out
non-null strings of (A+B+C)* which contain no repeated substrings. It may
be found in BEING""::[POSTPISCHIL.PUB]NR.EXE. It will print strings in order
of increasing length, but not in lexical order. There is currently no
provision for stopping it if it does not run out of memory and there are
an infinite number of such strings, so either run it on a hardcopy terminal
or "DEFINE/USER SYS$OUTPUT filename" and stop it after awhile.
If anybody wants to work with the source to have it do other things, it is
in NR.C, although there are no comments. If anybody wishes, I will add comments
shortly.
-- edp
|
203.7 | | TURTLE::GILBERT | | Mon Sep 16 1985 11:51 | 286 |
| Here are the first 22528 characters of the 'lexically first' string.
22528:
abacabcacbabcabacabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcacbabc
bacabacbabcabacabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabac
babcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabaca
bcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabc
acbacabacbabcabacbcabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbab
cabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabca
cbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcba
bcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbab
cabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcac
babcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcac
babcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacb
abcbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabc
abacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcac
babcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacb
abcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabc
abacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabca
cbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcab
acbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacb
acabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbab
cabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabc
abacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacb
acabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabc
abacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabc
abacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabca
bacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacb
abcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacba
bcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabc
acbabcabacbabcbacabacbabcabacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacb
abcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacba
bcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbab
cacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabca
cbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcba
bcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbab
cabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcac
babcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcac
babcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacb
abcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabc
abacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabca
cbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcab
acbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacb
acabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacab
acbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacbabcacbaca
bacbabcabacabcacbabcabacbabcacbacabcacbabcabacabcacbabcbacabacbabcabacabcacbabca
bacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcabacabcacbabcabacbcabcbabcab
acabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcaba
cabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbaca
bacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcaba
cabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcaba
cbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabac
abcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbab
cacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcaba
cbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabac
babcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacb
abcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacab
cacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabc
babcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacb
abcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbab
cabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabc
abacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcaba
cabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbab
cabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabc
abacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcaba
cbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacba
cabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacb
cabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbaca
bacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcab
acabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcaba
cabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbaca
bacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcaba
cabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcaba
cbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabac
abcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabc
abacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabca
cbacabacbabcabacbcabcbabcabacabcacbabcabacbabcacbacabacbabcabacabcacbabcabacbabc
acbacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcac
bacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabc
abacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbaca
bacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcac
bacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcaba
cbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacba
cabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcbabcabacbabcacb
acabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabac
bcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbac
abacbabcabacabcbabcabacbabcacbacabacbabcabacbcacbabcabacabcacbabcabacbabcacbacab
acbabcabacabcacbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabca
bacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabac
babcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabaca
bcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcab
acabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcab
acbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcaba
cabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacba
bcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacab
acbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcab
cbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabac
babcabacabcbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabc
abacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabca
cbacabacbabcabacabcacbabcabacbabcbacabacbabcabacabcacbabcabacbcabcbabcabacabcacb
abcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabc
abacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabca
bacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbc
abacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacb
acabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabca
bacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacab
acbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacb
acabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabac
bcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbac
abacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabca
bacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabac
babcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabaca
bcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbabcabacabc
babcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabacabca
cbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcabacabcacba
bcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbab
cacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcb
abcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabacabcac
babcabacabcbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabca
cbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabc
acbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcba
bcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacba
cabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabc
acbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabca
bacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcac
bacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacba
bcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabca
bacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcab
acabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbabcabac
abcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabaca
bcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcabacabca
cbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabac
babcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabaca
bcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabc
acbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabac
babcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcb
abcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacba
bcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabc
acbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcb
abcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacba
bcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabca
cbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabc
acbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcba
bcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacb
cabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbaca
bacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcab
acbabcbacabacbabcabacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbac
abacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbc
abcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabaca
bcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabaca
bcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacab
cacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabc
babcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcaba
cabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacba
bcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabc
acbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacab
cbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabac
babcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbab
cacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabca
bacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcab
acbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbac
abacbabcabacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabca
bacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcab
acabcacbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabcacbabcabacabcacbabc
bacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcaba
cabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbab
cabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabc
abacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcaba
cbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacba
cabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacb
cabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbaca
bacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcab
acabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcaba
cabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbaca
bacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcaba
cabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcaba
cbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabac
abcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabc
abacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabca
cbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacb
abcabacbabcbacabacbabcabacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabc
acbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabca
bacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcac
bacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacba
bcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabca
bacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcab
acabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbab
cabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbab
cabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabc
abacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcaba
cbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacba
cabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacb
cabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbaca
bacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacb
abcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacbabcacbacabac
babcabacabcacbabcabacbabcacbacabcacbabcabacabcacbabcbacabacbabcabacabcacbabcabac
babcacbacabacbabcabacabcacbabcabacbabcbacabacbabcabacabcacbabcabacbcabcbabcabaca
bcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacab
cbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabac
babcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacab
cacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacba
bcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabc
babcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcac
bacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacba
bcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbab
cabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabc
acbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcac
babcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbab
cabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabc
abacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbabcab
acabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcaba
cabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcabacab
cacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcab
acbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcaba
cabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacba
bcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacab
acbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcab
cbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabac
babcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabaca
bcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacab
cbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabac
babcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacab
cacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacba
bcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabc
babcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcaba
cbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcab
acabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcab
acabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcaba
cabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabac
abcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabc
abacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcaba
cbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabac
abcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcaba
cabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbaca
bacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabac
babcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcba
bcabacabcacbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabcacbabcabacabcac
babcbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbab
cabacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabca
cbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcac
babcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbab
cabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabc
acbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabca
bacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcac
bacabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcba
bcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbab
cabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcac
bacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbab
cabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbab
cabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabc
abacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbacabac
babcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacb
abcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacab
cacbabcabacbabcbacabacbabcabacabcacbabcabacbcabcbabcabacabcacbabcabacabcbabcabac
babcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacb
abcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacba
bcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabc
acbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcb
abcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacba
bcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcacbcabacabca
cbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabca
cbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcac
babcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbab
cabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabc
acbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabca
bacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcac
bacabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbaca
bacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacbabcacbac
abacbabcabacabcacbabcabacbabcacbacabcacbabcabacabcacbabcbacabacbabcabacabcacbabc
abacbabcacbacabacbabcabacabcacbabcabacbabcbacabacbabcabacabcacbabcabacbcabcbabca
bacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcab
acabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbac
abacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcab
acabcacbcabacabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcab
acbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcaba
cabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacba
bcacbacabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabacabcacbabcabacabcbabcab
acbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacaba
cbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabac
babcacbacabacbabcabacabcbabcabacbabcacbcabacabcacbabcabacabcacbcabacabcbabcabaca
bcacbabcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacab
cbabcabacbabcacbacabacbabcabacbcabcbabcabacabcacbabcabacabcbabcabacbabcacbacabac
babcabacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabcabacabcacba
bcabacabcbabcabacbabcacbacabacbabcabacabcacbabcabacbabcacbacabacbabcabacbcabcbab
cabacabcacbabcabacabcbabcabacbabcacbacabacbabcab
|
203.8 | | TURTLE::GILBERT | | Mon Sep 16 1985 19:13 | 48 |
| There can be an infinitely long string of As, Bs, and Cs, containing
no duplicate consecutive substrings.
Proof (by construction):
Take any string of As, Bs, and Cs, with no duplicate consecutive substrings,
and make the following simultaneous substitutions (the spaces are added for
readability, and also split the strings into uniquely identifying substrings):
A -> ABACABCACBABCB ACABACBABC
B -> ABACABCACBABCABACBAB CACBC
C -> ABACABCACBABCABACBCA BCBABC
This produces a longer string containing no duplicate consecutive substrings,
by the following theorems.
Theorem 1:
The above construction (which we'll rewrite as: A -> A1 A2, B -> B1 B2,
C -> C1 C2), when applied to any string (x) of As, Bs, and Cs containing no
duplicate consecutive substrings will yield a longer string (y) containing
no duplicate consecutive substrings of length greater than 26.
Proof (by contradiction):
Given a counter-example (y), we can use the technique in 203.5 to find a
counter-example where the repeated substring is a concatenation of the three
strings: A1 A2, B1 B2, C1 C2. Because of the uniquely identifying substrings,
the string (x) must have contained a repeated consecutive substring (consider
the reverse transformation).
Theorem 2:
The above construction, when applied to any string (x) of As, Bs, and Cs
containing no duplicate consecutive substrings will yield a longer string
(y) containing no duplicate consecutive substrings of length less than or
equal to 26.
Proof (by exhaustion):
Checking that the construction process, when applied to any input string of
5 or fewer As, Bs, and Cs containing no repeated consecutive substrings (note
that 5 occurences are used to assure that the length of the constructed string
is at least 25+26+26+25).
- Gilbert
|
203.9 | | SPRITE::OSMAN | | Wed Sep 18 1985 17:02 | 14 |
| *sigh*. Often, I really WANT to understand some gritty proof, but then
some simple thing near the beginning throws me, and I'm lost to all the
rest, and I get frustrated. In .5, the following confuses me:
>following sense. A substring t of s uniquely identifies the s , if, in any
> i i
>concatenation of the s, (s |s |...|s )*, containing exactly one occurrence
> 1 2 n
>of s , there is exactly one occurence of t.
What is that thing with the "*" after it ???
/Eric
|
203.10 | | BEING::POSTPISCHIL | | Wed Sep 18 1985 17:52 | 35 |
| Re .9:
If r and s are sets containing strings,
rs is the set containing every concatenation of a string in r with
a string in s.
r|s or r+s is the union of r and s.
r* (usually written as a superscript and called the Kleene star
[I am uncertain of the spelling]) is the set containing the
empty set and all the strings of r, rr, rrr, and so on.
There is often a mixing of notation, so that a string x represents the
singleton x (the set containing only the string x).
Examples:
Let S = { "abc" }.
Informally, S = abc.
a+bc = { "a", "bc" }.
(a+b)c = { "ac", "bc" }.
*
(a+b) c = { "c", "ac", "bc", "aac", "abc", "bac", "bbc", "aaac",
. . . }.
2
S = { "aa", "ab", "ac", "ba", "bb", "bc", "ca", "cb", "cc" }.
-- edp
|
203.11 | | TURTLE::GILBERT | | Wed Sep 18 1985 21:00 | 29 |
| re .9:
Sorry about the confusion. Basically, if there's a t(i) in the
constructed string, you know there's a s(i) there. I had trouble
finding some way to make this precise, while handling all the
'silly' special cases -- even so, I think that part lacks something.
re .6:
On what device is that?
re .0, .2:
Thanks for an interesting problem!
The lexically first string is fairly interesting...
I know the readers of this file aren't interested in algorithm design,
but is there a fast way of generating the first n characters of this
string (like order(n log n), for large n)?
If you take the first 22526 characters (in 203.7), and split them into
pieces just before an occurence of "abacabcacb", you'll get 498 pieces,
and except for one that's 100 characters long, they range from 13 to 79
characters. Of these 498 pieces, there are *ONLY* 21 different kinds!
Consider this: If you split the string into regular 4-character chunks
(i.e., small), you'd find more than just 21 kinds of pieces!
Might the entire lexically first string simply be a concatenation of these
21 different pieces? It's hard to believe this possible.
The patterns (easier to recognize when broken into chunks) are very regular,
yet not too regular, else there'd be a repeated substring. Might there be
some way to turn this into a nice display?
|
203.12 | | BEING::POSTPISCHIL | | Thu Sep 19 1985 09:11 | 8 |
| Re .11:
I don't know how I forgot the device, especially since I remember checking
to make sure I had the right one. The full specification is
BEING""::USER1:[POSTPISCHIL.PUB]NR.EXE or .C.
-- edp
|
203.13 | | SPRITE::OSMAN | | Thu Sep 19 1985 19:25 | 207 |
| Well, I'm still working on understanding that proof.
In the mean time, consider a totally different algorithm of generating strings
without duplicate repeating substrings.
Namely, start with an initial "good" string; the null string is as good as
any.
Then, attempt to append an "A". If this is invalid, append a "B" instead,
unless that's invalid too, in which case append "C". If "C" is invalid,
shrink the string by one character, and assume the last letter of the
shrunken string is invalid and increment it etc.
Hence we proceed like this:
A
AA no AB
ABA
ABAA no ABAB no ABAC
ABACA
ABACAA no ABACAB
ABACABA
ABACABAA no ABACABAB no ABACABAC no Here's the first place we back up
ABACABB no ABACABC
etc.
Now, consider a program thinking all of the above to itself. The program
desires to print a continuous stream of letters such that treated as one
string, it contains no duplicate repeating substrings.
The first thought might be to perform the above algorithm, and every time
the total string is longer than what's been printed already, print some
more letters.
The problem, though, is that the program would end up needing to back up
and change letters that have already been printed !
So, to avoid this problem, introduce a SAFETY_MARGIN, which is the number
of letters beyond what's been printed that the string must grow to, before
the program dares "commit" to printing more letters.
An interesting question is, what SAFETY_MARGIN is sufficient ?
My experimental evidence is that once "things get going", 6 seems
sufficient. Can someone prove this ? Another way of putting it is that
experimentally, we never need to back up more than 5 positions.
I've included a C program in this note that implements the algorithm.
The program has a safety_margin of 20 built in to it. However, once
things "get going", you can decrease the margin to 6 like this:
^Y
$ DEBUG
DBG> DEPOSIT safety_margin=6
DBG> GO
What the program does is print a continuously growing "valid" string.
However, the program interrupts the printout by announcing new "close calls"
which are measured by a new minimum number of characters away from failure
due to needing to modify an already-printed character.
Every time the program prints more characters, it resets its tally of
close calls. Hence you see more interruptions than you might want.
Of course, you can comment out the close call announcement if you'd
like a real continuous string printed out.
Another interesting question is, what would be a less expensive algorithm
for printing a continuous valid string ? Gilbert's algorithm doesn't
allow printing a continuous string at all, since we must keep going back
and replace already-printed letters by other letters.
My algorithm is very expensive, since more and more STRNCMP's (string
comparison's) are necessary as the string gets longer. Is there an
algorithm that perhaps doesn't require string compare's at all ? Perhaps
something like "xor all the letters so far and . . .".
As a subquestion for my algorithm, must ALL substrings actually be
checked ? For instance, once we "get going", would adding a new letter
or incrementing the last ever produce an ALMOST valid string, only invalidated
by being of the form YY, where Y is half the entire string ? (If not, then
we need never compare Y with Y and we can ask the next question about Y-1
Y-1 etc.)
Here's the program:
/* Define size of largest string we can handle.
*/
#define max_i 50000
/* String to contain the characters. */
char the_string[max_i];
/* Define margin. We print letters to within this close to the end of our
/* string. We fear not get closer, lest we end up needing to change a letter
/* we've already printed (a danger we detect during the program).
*/
safety_margin = 20;
/* Define close_call, which indicates how close we've ever gotten to having
/* to give up due to having already printed a letter we need to change.
*/
close_call = 100;
/* Define suspence value, which is number of letters beyond what we've
/* already printed, that we compute before even considering printing more.
*/
suspence = 0;
/* Define which index we've printed up to.
*/
largest_i_printed = -1;
main () {
int i;
/* Initialize string to all one less than first letter. This allows main loop
/* to always think it is "modifying" a letter.
*/
for (i = 0; i <= max_i; (the_string[i] = 'A' - 1, i++));
/* Start loop which establishes i, which indicates which letter is the next
/* one to modify. 0 means the first letter.
*/
for (i = 0; i <= max_i || i < 0; i++)
{
int j;
if (the_string[i] == 'C')
{
/* We've already tried A, B, and C in current position, so we back up such
/* that a new letter is tried in previous position.
/* We make sure we're not backing up into a letter that's already been printed.
/* If so, we quit.
*/
the_string[i] = 'A' - 1;
i = i - 2;
if (i - 1 <= largest_i_printed)
{
printf (
"\nWhoops. In order to proceed, I'd need to change an already-printed letter!"
);
return;
}
/* Announce if we're closer than we've ever been to having to give up.
*/
if (i - 1 - largest_i_printed < close_call)
{
close_call = i - 1 - largest_i_printed;
printf ("\n New closest call = %d\n", close_call);
}
continue;
} /* end of check for all letters tried in this position */
/* Whatever letter we modify, we make it one larger than it was (which sets
/* it to 'A' if it was "never" modified)
*/
the_string[i]++;
/* Start loop to make sure there are no duplicate repeating substrings.
*/
for (j = 1; j <= (i+1)/2; j++)
{
if (strncmp (&the_string[i-2*j+1], &the_string[i-j+1], j) == 0)
{
/* There is a duplicate repeating substring, so tell main loop to increment
/* current character instead of moving on to next position.
*/
i--;
break;
};
}
/* We'll now print some more letters if we've computed sufficiently ahead
/* so as to believe we won't have to back up into stuff already printed,
/* AND if we've computed sufficiently more than has already been printed.
*/
if (i - safety_margin > largest_i_printed &&
i - suspence > largest_i_printed)
{
/* Mark where we're starting to print from.
*/
int start_here = largest_i_printed + 1;
/* Compute what will be the new largest index printed.
*/
largest_i_printed = i - safety_margin;
{
/* Print the chunk. We do this by planting a null character and using
/* the function that prints through next null. Then we restore the character
/* that we temporarily nulled. (Too bad a "precision" value in PRINTF can't
/* be a variable, or at least a symbolic constant)
*/
int save_beyond = the_string[largest_i_printed + 1];
the_string[largest_i_printed + 1] = 0;
printf ("%s", &the_string[start_here]);
close_call = safety_margin;
the_string[largest_i_printed + 1] = save_beyond;
} /* end of block for variable save_beyond */
} /* end of print block */
} /* end of main loop */
/* We get out of main loop for one of TWO reasons. Either we backed all
/* the way back to the beginning, which means we've proven that there is
/* a largest string not containing any duplicated substring, OR we filled
/* our entire array.
*/
if (i <= 0)
printf (
"\nThe longest string containing no duplicate substrings was found !");
else
printf (
"\nNo conclusion reached. You may want to increase 'max_i' and try again.");
} /* end of procedure main */
|
203.14 | | ALIEN::POSTPISCHIL | | Fri Sep 20 1985 09:24 | 8 |
| Re .8:
I have a nit: There are no "infinitely long strings" satisfying the stated
requirements, as .8 claims. However, as the proof shows, there length of
such strings is unlimited, although each string is finite.
-- edp
|
203.15 | | TURTLE::GILBERT | | Fri Sep 20 1985 14:14 | 49 |
| re 14:
Apply the infinity lemma (I think it applies, and Knuth has a proof).
re 13:
Try applying the substitution, and supposing that a repeated substring
is generated. This should illustrate the gist of the proof. Or, if
you're around ZK2 sometime, stop by, and I'll try to go explain it.
re 13:
You're generating the 'lexically first' string (good).
Up through a length of 11,000, there's only one place where
you need to back up 5 positions (near length 81). There are
only 23 places where you need to back up 3 or more places
(one 5, 13 fours, and 9 threes). The string is *very* regular
(pretty much).
Another question is how far back you look to discover a duplicated
substring. Usually, not much, but when you do.... I displayed a
line whenever the algorithm needed to search back more than 63 places
to discover a duplicated substring, and there were only 54 of these
(again, in the first 11,000 characters). The amount it needed to
search back (when more than 63) ranged from 92 (at 2174) to 2532
(at 5928). Many of these 'back-search' amounts occurred frequently;
92 - 6 times
100 - 1 time (at 437)
110 - 7 times
124 - 19 times (wow, is this the same substring each time?)
261 - 7 times
274 - 9 times
1225 - 3 times (at 3351, 5883, and 9183)
2532 - 1 time (at 5928)
If we could prove that these are the only 'search back' amounts
(this is doubtful), or even restrict 'large' amounts to a fairly
small set, then the algorithm for producing the 'lexically first'
string could be enormously improved, probably even to linear time!
Might it may happen that later we need to back up *thousands* of
characters!? I don't think so; consider the string "CBCACBACABC",
which is the start of the 'lexically last' string, gotten by taking
the start of the lexically first string, and changing A <-> C.
It's unlikely to occur in the 'lexically first' string (and, in fact,
it doesn't occur in the first 22,000 characters). Also, it doesn't
occur in the string generated by 203.8. So we can take what's been
generated, append "CBCACBACABC" and the infinite string generated
by 203.8 to produce an infinitely long string. Certainly, the true
lexically first string must be less than this, and must be greater
than what we've generated so far. Now if we could only *guarantee*
that "CBCACBACABC" never occurs in the lexically first string....
|
203.16 | | BEING::POSTPISCHIL | | Fri Sep 20 1985 15:33 | 6 |
| Re .15:
Can you supply a reference for Knuth's proof of the infinity lemma?
-- edp
|
203.17 | | METOO::YARBROUGH | | Fri Sep 20 1985 15:48 | 3 |
| The Art of Computer Programming, Vol. 1, p. 381 et seq.
Lynn Yarbrough
|
203.18 | | LATOUR::AMARTIN | | Sat Sep 21 1985 12:24 | 50 |
| Re .10:
r* (usually written as a superscript and called the Kleene star
[I am uncertain of the spelling]) is the set containing the
empty set and all the strings of r, rr, rrr, and so on.
Uh, actually, it is:
r* (usually written as a superscript and called the Kleene star
[I am uncertain of the spelling]) is the set containing the
empty string and all the strings of r, rr, rrr, and so on.
^^^^^^
Re .13:
I think you'll find that the algorithmic paradigm you have used is called
backtracking, just in case you want to look it up, or you want to
know when a paper that uses the term might be worth reading. Someone
(Wirth?) wrote an excellent walk-through of the design of a structured,
efficient Pascal program to solve the 8-queens problem that uses
backtracking. I'll try and find the reference and post it here.
Surprise! Your wishes about printf have been answered. You may want to
change this code:
{
/* Print the chunk. We do this by planting a null character and using
/* the function that prints through next null. Then we restore the character
/* that we temporarily nulled. (Too bad a "precision" value in PRINTF can't
/* be a variable, or at least a symbolic constant)
*/
int save_beyond = the_string[largest_i_printed + 1];
the_string[largest_i_printed + 1] = 0;
printf ("%s", &the_string[start_here]);
close_call = safety_margin;
the_string[largest_i_printed + 1] = save_beyond;
} /* end of block for variable save_beyond */
to this:
{
/* Print the chunk. (Luckily a "precision" value in PRINTF can
/* be a variable, or any runtime expression)
*/
printf ("%.*s", largest_i_printed + 1, &the_string[start_here]);
close_call = safety_margin;
} /* end of block that prints the chunk */
I just added one of these in my project's code.
/AHM
|
203.19 | | BEING::POSTPISCHIL | | Sat Sep 21 1985 18:53 | 8 |
| Re .18:
Oops, you're right, that should have been the empty set.
Is the ".*" standard in any way?
-- edp
|
203.20 | | LATOUR::AMARTIN | | Sun Sep 22 1985 23:59 | 6 |
| I don't have K&R around, but I seem to remember that it is mentioned
in Harbison and Steele.
Note that you can't use * for variable formats in scanf because it
means "don't assign the value to a variable". Sigh.
/AHM
|
203.21 | | SPRITE::OSMAN | | Mon Sep 23 1985 12:27 | 21 |
| Although the stream algorithm required checking a substring of length
2532 in order to invalidate a candidate at L=5928, this still isn't
quite the entire string.
It might be interesting to modify the program to announce what
larger percents were required, rather than raw lengths. Then the
question becomes what percentage of the string must we check to
guarantee validity.
Does anyone have a proof that backing up more than 5 positions is
never required ? It must have something to do with the fact that 5
is one less than twice 3, and 3 is the number of symbols we are dealing
with.
Mr. Gilbert, you mentioned that the stream is very "regular". Can
you be more specific ? What patterns do you notice ? Perhaps if
we can identify a pattern, we can ask (and answer) the question
"prove that the infinite string generated by the following pattern
is a valid string"
/Eric
|
203.22 | | R2ME2::GILBERT | | Mon Sep 23 1985 15:26 | 9 |
| re 203.13:
"Must all substrings actually be checked?" No. In generating the lexically
first substring (which you'll recall starts "abacabcacbabcabacabcacbabacbabc"),
we need never check for a string of the form YY (once we've gotten more than 25
characters). To see this, note that the beginning of the string can never
reappear -- certainly, the immediately preceeding character can't be an "a",
nor can it follow a "b" (since then we'd have "baba"). And it can't follow
a "c", since that would cause the string "cabacabcacbab" to be duplicated.
|
203.23 | | SPRITE::OSMAN | | Mon Sep 23 1985 18:32 | 24 |
| In the published first 22528 characters of the lexically first string, the
numbers of a's, b's, and c's are:
a - 8373
b - 7529
c - 6626
The fact that there are more a's is what I'd expect. This is because
the algorithm always tries "a" first, and only replaces the position with
"b" if it gets into trouble. Hence anywhere that "a" OR "b" would work,
the "a" is published.
Limit question:
As we produce more and more characters of this lexically
first string, is there a convergent limit as to the ratio of
numbers of a's to b's to c's ? What is that limit ?
Counting question:
How many valid strings are there of length n ?
/Eric
|
203.24 | | ALIEN::POSTPISCHIL | | Mon Sep 23 1985 19:21 | 36 |
| Re .23:
Here are the numbers of strings by length:
Length Number
0 1
1 3
2 6
3 12
4 18
5 30
6 42
7 60
8 78
9 108
10 144
11 204
12 264
13 342
14 456
15 618
16 798
17 1044
18 1392
19 1830
20 2388
21 3180
22 4146
23 5418
24 7032
25 9198
26 11892
27 15486
-- edp
|
203.25 | | ALIEN::POSTPISCHIL | | Tue Sep 24 1985 09:40 | 8 |
| .22 is incomplete because we have not yet proven it will not be necessary
to back up as far as the twenty-fifth character, although it seems unlikely.
For that matter, we cannot yet be certain that Gilbert's string actually
is the lexically-first string.
-- edp
|
203.26 | | ALIEN::POSTPISCHIL | | Tue Sep 24 1985 09:43 | 6 |
| Re .19:
I can't even get my corrections right; that should have been "empty string".
-- edp
|
203.27 | | SPRITE::OSMAN | | Tue Sep 24 1985 10:05 | 12 |
| Regarding the comment about "We can't be sure Gilbert's is really the
lexically first string", if it's the same as generated by my program,
then it is the first string, by definition, since the program tries to
lay down "AAAA..." and when it can't, it lays down the closest thing
to it.
Regarding counting strings for each length n, we might as well look at
f(n)/6, since f(3...) are necessarily all divisible by 6. This is
because every string is related to five others by changing A,B,C's to
A,C,B or B,A,C or C,B,A or B,C,A. There will be duplicates, though.
/Eric
|
203.28 | | BEING::POSTPISCHIL | | Tue Sep 24 1985 10:16 | 9 |
| Re .27:
The reason Gilbert's string might not be the first string is that we might
need to back up over part of what he has already said is part of the string.
In fact, if there is no infinitely-long string, there will be no lexically-
first string.
-- edp
|
203.29 | | R2ME2::GILBERT | | Tue Sep 24 1985 14:22 | 22 |
| To recap the important notes in this discussion....
203.0 Poses the problem; Is there an infinite string?
.1-.4 A rough proof that is flawed. Noticing the flaw.
.5 Outline of a proof.
.6,.12 EDP's program to find strings.
.7 First 22528 characters of lexically first string.
.8 Proof that arbitrarily long strings exist.
.9-.10 Kleene notation.
.11 Comments on lexically first string.
.13 Osman's program for lexically first string; how far might it back up?
.14 Arbitrarily long is not infinite.
.15 Comments on backing up.
.16-.17 Reference for the infinity lemma.
.18-.20 Notation, terminology, and printf.
.21 How far might we back up? How far back to search?
.22 Needn't search back all the way (also see last paragraph in .15).
.23-.24 Character distribution, and counting question.
.25-... Here I get confused.
There are several interesting, open questions. Would someone care to gather
these in another note?
|
203.30 | | R2ME2::GILBERT | | Tue Sep 24 1985 15:37 | 54 |
| re .14, .28 (and .16-.17):
Does the infinity lemmma apply? I'll continue to assume it does,
or can be made to apply, or that an infinitely long string exists
for other reasons.
Consider this. The substitution algorithm can be applied again
and again to yield a countably infinite number of strings (that
is, in a one-to-one correspondence with the counting numbers).
Suppose an infinitely long string (containing no consecutive
duplicated substrings) doesn't exist; then some such string must
be a longest -- let the length of a longest such string be M.
This implies that there are fewer than 3**M different such strings
(actually, a bound of 2**M could be used). This is a contradiction,
since we know there's an infinite number of such strings.
re .25, .22, .15:
The last paragraph in .15 can be used to show that we need never
back up as far as the 25th character, since it shows a construction
that gives an infinitely long string that starts off the same as the
lexically first string (for 22,000 characters).
-> Note that the algorithm in .13 approaches the lexically first string
from below. Any particular infinitely long string places an upper
bound on the lexically first string. If the lower bound and the
upper bound start off the same way, then so must the true lexically
first string.
Note .22 has a typo -- the lexically first string starts
"abacabcacbabcabacabcacbacabacba". To correct the argument on why
this string cannot reappear later in the string.... If it were to
follow "ac", then "acab" would be repeated. Were it to follow "bc",
then "bcabacabcacba" would be repeated.
Since the first 24 characters of the lexically first string are
never repeated, after we've gone far enough (past 50 characters),
we need never worry about searching all the way to the beginning
to discover a string of the form YY. Of course, this knowledge
offers practically nothing for improving the performance of an
algorithm to generate the lexically first string.
re .11:
The '21 different pieces' mentioned there. I've changed my mind --
it looks like it should be possible to prove that the lexically first
string *is* simply a concatenation of these 21 pieces! Stay tuned.
re "Gilbert's string":
I'm unsure what this means. The characters in .7 were generated by an
algorithm like that in .13. A string generated by the substitutions
in .8 does not produce the lexically first string. Notes .7 and .8
are largely unrelated.
|
203.31 | | R2ME2::GILBERT | | Sat Sep 28 1985 11:04 | 87 |
| Do you want to see the first 84992 characters of the first string? Let:
d = abacabcacbabcabacabcacbacabacbabc
e = abacabcacbabcabacabcacbcabacabcbabc
f = abacabcacbabcabacabcbabcabacbabcacbacabacbabc
g = abacabcacbabcabacabcbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbc
h = abacabcacbabcabacbabcacbacabacbabc
i = abacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabc
j = abacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcabcbabcabacbabcacbacabacbabc
k = abacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcabacbcacbabc
l = abacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbacabacbabcacbc
m = abacabcacbabcabacbabcacbacabacbabcabacabcbabcabacbabcacbc
n = abacabcacbabcabacbabcacbacabacbabcabacabcbacabacbabcabacbcabcbabc
o = abacabcacbabcabacbabcacbacabacbabcabacbcabcbabc
p = abacabcacbabcabacbabcacbacabcacbabc
q = abacabcacbabcabacbabcacbacabcbabc
r = abacabcacbabcabacbabcacbc
s = abacabcacbabcabacbabcbacabacbabc
t = abacabcacbabcabacbcabcbabc
u = abacabcacbabcbacabacbabc
v = abacabcacbabcbacabacbabcabacabcacbacabacbabcabacabcbabcabacbabcacbacabacbabc
w = abacabcacbabcabacabcacbcabcbabc
Then we can write the string as follows (numbers represent character offsets
in the string):
84992:
0 dhvifjlefiflefiuhstfifmefiflefifmefifnfofstfifmefiflefifmefifnfohpuhstfi
3642 fmefiflefifmefifnfofstfifmefiflefifmefifnfohqfiflefiflfifkhrefiflefifmef
7366 ifnfofstfifmefiflefifmefifnfofstfifogefiflefifmefifnfofstfifmefiflefifme
11054 fifnfofstfigefiflefifmefifnfofstfihpuhstfifmefiflefifmefifnfofstfifmefif
14638 lefifmefifnfohpuhstfifmefiflefifmefifnfofstfifmefiflefifmefifngefiflefif
18316 mefifnhpuhstfifmefiflefifmefifnfofstfifmefiflefifmefifnfohpuhstfifmefifl
21859 efifmefifnfofstfifmefiflefifmefifnfstfifmefiflefifmefifogefiflefifmefifn
25601 fofstfifmefiflefifmefifnfofstfigefiflefifmefifnfofstfihpuhstfifmefiflefi
29180 fmefifnfofstfifmefiflefifmefifnfohpuhstfifmefiflefifmefifnfofstfifmefifl
32778 efifmefifngefiflefifmefifnhpuhstfifmefiflefifmefifnfofstfifmefiflefifmef
36431 ifnfohpuhstfifmefiflefifmefifnfofstfifmefiflefifmefifnfstfifmefiflefifm
40017 wfiflefiflwfi
40715 flfifkhrefiflefifmefifnfofstfifmefiflefifmefifnfofstfifogefiflefifmefifn
44448 fofstfifmefiflefifmefifnfofstfigefiflefifmefifnfofstfihpuhstfifmefiflefi
48027 fmefifnfofstfifmefiflefifmefifnfohpuhstfifmefiflefifmefifnfofstfifmefifl
51625 efifmefifngefiflefifmefifnhpuhstfifmefiflefifmefifnfofstfifmefiflefifmef
55278 ifnfohpuhstfifmefiflefifmefifnfofstfifmefiflefifmefifnfstfifmefiflefifme
58898 fifogefiflefifmefifnfofstfifmefiflefifmefifnfofstfigefiflefifmefifnfofst
62597 fihpuhstfifmefiflefifmefifnfofstfifmefiflefifmefifnfohpuhstfifmefiflefif
66142 mefifnfofstfifmefiflefifmefifngefiflefifmefifnhpuhstfifmefiflefifmefifnf
69825 ofstfifmefiflefifmefifnfohpuhstfifmefiflefifmefifnfofstfifmefiflefifmefi
73427 fnfstfifmefiflefifmwfiflefiflfifkhrefiflef
75605 ifmefifnfofstfifmefiflefifmefifnfofstfifogefiflefifmefifnfofstfifmefifle
79293 fifmefifnfofstfigefiflefifmefifnfofstfihpuhstfifmefiflefifmefifnfofstfif
82877 mefiflefifmefifnfohpuhstfifmefiflefifmefifabacabcacbab
But if we also use the following substitutions (here, I've used upper case):
A = dhvifjle
B = fiflefiflfifkhre
C = fiflefiflwfiflfifkhre
D = fiflefifmefifabacabcacbab
E = fiflefifmefifnfofstfifme
F = fiflefifmefifnfofstfifoge
G = fiflefifmefifnfofstfige
H = fiflefifmefifnfofstfihpuhstfifme
I = fiflefifmefifnfohpuhstfifme
J = fiflefifmefifnfohq
K = fiflefifmefifnfstfifme
L = fiflefifmefifnge
M = fiflefifmefifnhpuhstfifme
N = fiflefifmefifoge
O = fiflefifmw
P = fiflefiuhstfifme
Then we can write the first 84992 characters as:
APEIEJBEFEGHEIELMEIEKNEGHEIELMEIEKOCEFEGHEIELMEIEKNEGHEIELMEIEKOBEFEGHEID
During the course of generating this, several very long 'search back amounts'
were found, at the following character positions:
search: 15538 40042
search: 15538 74427
search: 34385 74967
No 'backup amounts' greater than 4 were found, besides the backup of 5 at 81.
- Gilbert
|
203.32 | | LATOUR::AMARTIN | | Sun Sep 29 1985 17:14 | 7 |
| Re .19:
While "%.*s" is not mentioned in K&R, it is mentioned, with no portibility
qualification, in Harbison & Steele, and is supported by both Vax-11 C and
Ultrix. It is also mentioned in Bourne's "The UNIX System" and the
30-Apr-85 draft ANSI C standard.
/AHM
|
203.33 | | SPRITE::OSMAN | | Mon Sep 30 1985 17:53 | 29 |
| I'm still working on trying to prove that the straight-forward infinite
string generation program will never have to back up more than 5 places,
or that indeed it eventually will have to.
I don't have a proof yet. However, I'm trying a contradiction approach.
I've been thinking, suppose our string so far ends with:
...CBCACBC
This has the property that if we need to back up, we'll have to back
up at least seven places (because the last "C" means we've already tried
"A" and "B", the last "B" can't be changed to a "C", the second to last
"C" means "A" and "B" tried already, the last "A" can't be changed to "B"
or "C" etc.).
So, if we can show that we'll encounter ...CBCACBC at some point AND
have to back up, then we've shown that yes, sometimes you have to back
up more than 5 places, so we're done with the proof.
On the other hand, if we can show that we'll never encounter
...CBCACBC at all, or if we can show that if we do encounter ...CBCACBC,
we'll never have need to back over it, then we might be closer to proving
that you never need to back up seven places.
For instance, if we can show this, then if we iterate on all possible
seven-letter endings, we'll have a proof by exhaustion.
/Eric
|
203.34 | | R2ME2::GILBERT | | Mon Sep 30 1985 20:03 | 12 |
| Well, "...CBCACBC" cannot occur, since the 'lead-in' must be either an "A"
or a "B", and either of these would prevent the algorithm from looking that
far (consider "..ACBCACBC" or "..BCBCACBC"). However, this is a nit, depending
on where you start 'backing up'.
Note that the substring "ACBCACB" *does* occur in the string. If there were
something that prevented an "A" from following this, we'd backup 7 or more.
I'd expected that the substring "CBCACB" wouldn't occur (it's so 'large') --
but it does. Now I wouldn't be surprised if "CBCACBAC" occurred somewhere
(BTW -- it doesn't occur in the first 131000 characters).
Does anyone have a faster way of generating the first string?
|
203.35 | | SPRITE::OSMAN | | Mon Sep 30 1985 20:50 | 14 |
| Formally, Peter's printout of the first 84992 characters might not be correct.
Until we prove that we'll never need to back up more than 5, or some other
small number, there's still the slight fear that some future backup will
be required that crosses back into what Peter's already printed.
The conjecture that we'll never have to back up more than 4 is probably
harder to prove than the one stating we'll never have to back up more
than 5.
That's because the one about 4 will have to include the singularity at
index=81, where we needed to back up 5. Fascinating !
/Eric
|
203.36 | | GOLLY::GILBERT | | Tue Oct 01 1985 15:27 | 2 |
| Ah, but it's fairly simple to show that the printout of the first 84992
characters is valid. The last paragraph of 203.15 shows how to do this.
|
203.37 | | SPRITE::OSMAN | | Wed Oct 02 1985 16:32 | 10 |
| That paragraph referred to ends with:
> than what we've generated so far. Now if we could only *guarantee*
> that "CBCACBACABC" never occurs in the lexically first string....
Since we can't yet *guarantee* blah blah, I still claim we *might* have
to back up into what you've already printed. (although I'm still working
on proving we'll never have to back up more than 5 characters)
/Eric
|
203.38 | | SPRITE::OSMAN | | Thu Oct 03 1985 16:53 | 20 |
| Define a "valid substring" as one that is not an "invalid substring".
"invalid substrings" are ones that either contain a duplicate repeating
substring OR ones that can't possibly occur as part of a longer string
due to not being able to be preceded or followed by anything. (i.e. "abacaba"
is invalid because neither a,b, or c may follow it).
So, listing all valid substrings by order of size, we have
a, b, c,
ab, ac, ba, bc, ca, cb,
aba, abc, aca, acb, bab, bac, bca, bcb, cab, cac, cba, cbc,
abac, abca, abcb, acab, acab, acba, acbc, babc, . . .
Considering valid substrings in this alphabetical ordering, what's the
first one that *doesn't* appear in the lexically-first string ?
What's the first one that doesn't appear in Gilbert's printout of
22000-or-so characters, for that matter ?
/Eric
|
203.39 | | GOLLY::GILBERT | | Fri Oct 04 1985 15:05 | 16 |
| re 203.38:
A fairly easy way to check for the occurence of these strings is
to consult the 'macroes' in 203.81.
new question:
Can somebody prove that the string "abaca" occurs an infinite number
of times in any infinitely long string in NR? Put another way, can
you show that there's some maximum distance that can separate two
occurrences of "abaca"? If that's too hard, how about "ab"? Or
how about finding some string whose occurences can be arbitrarily
far apart?
news:
Another large backup (of 5) was found near offset 190286. Also,
I'd like an independent verification of the lexically first string
-- have we any mathematicians/programmers up for this difficult task?
|
203.40 | | R2ME2::GILBERT | | Mon Oct 07 1985 13:54 | 42 |
| Noticing that large strings tend to repeat in the lexically first string,
perhaps there's some way to extend the string by large amounts, instead
of effectively duplicating previous work.
Consider the string that's been generated so far (less 10 characters, in
case we back up that much -- this we'll assume, pending Eric's proof). Now
find the shortest suffix string that doesn't occur elsewhere in the string.
Let this smallest suffix string be tUV, where t is either a, b, or c, and U
and V are strings, and the string tU is the shortest prefix of tUV that
doesn't occur previously in the string.
Thus, the string so far must look like: "...sUV...tUV", where s and t differ.
note that the string UV must occur previously, else tUV would would not be
the shortest unique suffix string.
Now, the next 'several' characters should depend only on the string UV, and
because the string UV has occurred before, we know what these characters
should be. How much is 'several'?
When searching back for a repeated string, we need look at t or earlier only
if we are about to produce a duplicated string of one of the forms:
V
...[...tUV...][...tUV...]
...[�V...t�][�V...t�]
...[...tUV...][...tUV...]
^
The arrows indicate where we 'are', and U has been broken into two parts: ��.
Note that the first cannot occur, since we're assured there are no occurences
of tUV to the left of where we are. Thus, we need only worry about the last
two forms. We can get an idea of how far we may proceed by searching back
for a previous occurence of V (the second form), or checking when we produce
another V (the third form).
The same considerations applied to the first occurence of the string sUV, which
must also have been a unique suffix string. We can find how far sUV proceeded
before it almost caused a duplication (again, look for a V, less 10 characters),
and proceed the same way after tUV (after checking that this would cause neither
the second or third forms above). Thus, we should be able to duplicate a large
chunk of previous effort, and a few more characters (generated the simple way)
should yield a shorter unique suffix string, whence we can repeat this extension
technique.
- Gilbert
|
203.41 | | SPRITE::OSMAN | | Mon Oct 07 1985 16:56 | 34 |
| I don't really think the macros in 203.31 help that much in deciding if
a given valid substring appears in s(84992) or not.
Sure, you can see if the substring is *completely* contained in a macro,
but checking for all possible double-macro overlaps is tedious.
I have started a C program that has s(22800) compiled into it as a string.
The program will use a procedure called BUMP, which when called with a
valid substring, will produce the next possibly-valid substring.
(I say "possibly-valid" because BUMP doesn't really check if A,B or C
can precede or follow the string)
The program will then make sure that substring is contained in s(22800),
and if not, the program will print it out. I'll go up through substrings
of eight characters, I suppose.
I'm having a bit of trouble coding BUMP. Perhaps some hotshot C programmer
can do it ?
To review, here are some sample inputs and what BUMP should produce as
output:
input BUMP should output
----- ------------------
a b
b c
c ab
cbc abac
abac abca
Can anyone write BUMP ?
/Eric
|
203.42 | | R2ME2::GILBERT | | Mon Oct 07 1985 20:32 | 143 |
| Actually, I think the macroes in 203.31 *are* pretty useful for answering
such questions. The macroes 'start off' with the same 10 characters, so
to determine whether "cbcabacbcab" appears, lop off the "ab" (which might
begin another macro), and search for "cbcabacbc" in the macroes. The reason
this is fairly useful is that it sure beats search though a 85K string.
Of course there are even better ways. Oh well.
The following 133 16-character strings are effectively what you've requested
in 203.41, except for the six simple substitutions of the characters.
abacabcacbabcaba
abacabcacbabcbac
abacabcacbacabac
abacabcacbacabca
abacabcacbacabcb
abacabcacbcabaca
abacabcacbcabacb
abacabcacbcabcba
abacabcbabcabaca
abacabcbabcabacb
abacabcbabcacbab
abacabcbabcacbac
abacabcbabcacbca
abacabcbacabacba
abacabcbacabacbc
abacabcbacbcabac
abacabcbacbcabcb
abacabcbacbcacba
abacbabcabacabca
abacbabcabacabcb
abacbabcabacbcab
abacbabcabacbcac
abacbabcacbacaba
abacbabcacbacabc
abacbabcacbcabac
abacbabcacbcabcb
abacbabcbacabacb
abacbabcbacabcac
abacbabcbacabcba
abacbabcbacbcaba
abacbabcbacbcabc
abacbabcbacbcacb
abacbcabacabcacb
abacbcabacabcbab
abacbcabacabcbac
abacbcabacbabcab
abacbcabacbabcac
abacbcabacbabcba
abacbcabcbabcaba
abacbcabcbabcacb
abacbcabcbacabac
abacbcabcbacabca
abacbcabcbacabcb
abacbcabcbacbcab
abacbcabcbacbcac
abacbcacbabcabac
abacbcacbabcacbc
abacbcacbabcbaca
abacbcacbabcbacb
abacbcacbacabacb
abacbcacbacabcac
abacbcacbacabcba
abcabacabcacbabc
abcabacabcacbaca
abcabacabcacbcab
abcabacabcbabcab
abcabacabcbabcac
abcabacabcbacaba
abcabacabcbacbca
abcabacbabcabaca
abcabacbabcacbab
abcabacbabcacbac
abcabacbabcacbca
abcabacbabcbacab
abcabacbabcbacbc
abcabacbcabcbabc
abcabacbcabcbaca
abcabacbcabcbacb
abcabacbcacbabca
abcabacbcacbabcb
abcabacbcacbacab
abcacbabcabacabc
abcacbabcabacbab
abcacbabcabacbca
abcacbabcbacabac
abcacbabcbacabca
abcacbabcbacabcb
abcacbabcbacbcab
abcacbabcbacbcac
abcacbacabacbabc
abcacbacabacbcab
abcacbacabacbcac
abcacbacabcacbab
abcacbacabcacbca
abcacbacabcbabca
abcacbacabcbacbc
abcacbcabacabcac
abcacbcabacabcba
abcacbcabacbabca
abcacbcabacbabcb
abcacbcabacbcacb
abcacbcabcbabcab
abcacbcabcbabcac
abcacbcabcbacaba
abcacbcabcbacabc
abcacbcabcbacbca
abcbabcabacabcac
abcbabcabacabcba
abcbabcabacbabca
abcbabcabacbabcb
abcbabcabacbcaba
abcbabcabacbcabc
abcbabcabacbcacb
abcbabcacbabcbac
abcbabcacbacabac
abcbabcacbacabca
abcbabcacbacabcb
abcbabcacbcabaca
abcbabcacbcabacb
abcbabcacbcabcba
abcbacabacbabcab
abcbacabacbabcac
abcbacabacbabcba
abcbacabacbcabac
abcbacabacbcabcb
abcbacabacbcacba
abcbacabcacbabca
abcbacabcacbabcb
abcbacabcacbacab
abcbacabcacbcaba
abcbacabcacbcabc
abcbacabcbabcaba
abcbacabcbabcacb
abcbacbcabacabca
abcbacbcabacabcb
abcbacbcabacbabc
abcbacbcabcbabca
abcbacbcabcbacab
abcbacbcacbabcab
abcbacbcacbabcac
abcbacbcacbabcba
abcbacbcacbacaba
abcbacbcacbacabc
|
203.43 | | BEING::POSTPISCHIL | | Tue Oct 08 1985 09:45 | 13 |
| I've been trying to figure out how to generate the lexically first string
(or any string for that matter), and I have a question about how Gilbert
is doing it.
I have thought of a few schemes to permit adding on large chunks, but I can't
seem to make anything workable. The best I have thought of so far is to
select a character and test to see if it causes a repeated substring. If
the string has a length l, this test requires something less than l/2
comparisons. Gilbert, is character-by-character essentially how you have
generated strings? How many comparisons do you make for each character?
-- edp
|
203.44 | | LATOUR::AMARTIN | | Tue Oct 08 1985 10:41 | 11 |
| I wonder what kind of loop-the-loop patterns develop as you build
an Aho-Corasick pattern matcher to recognize longer and longer prefixes
of the lexically first `string'? How about building a DFA to recognize
all members of NR of length less than k? (Or an NFA for that matter).
How does the number of states (for any of the above problems) grow as
a function of the length recognized?
What about context-free recognizers for that grammar that describes such
large leaps of the lexically first string.
/AHM
|
203.45 | | BEING::POSTPISCHIL | | Tue Oct 08 1985 12:40 | 13 |
| Re .44:
You're not using "DFA" for "deterministic finite-state automaton", are you?
If so, there exists a single automaton capable of recognizing all finite members
of NR, although it requires some sort of infinite memory (such as the tape for a
Turing Machine). On the other hand, things such as finite-state parsers will
get larger as the members of NR to be recognized increase. The increase will be
at least exponential: At a state in the middle of parsing a string, the parser
state will need to "encode" at least the last l/2 characters (where the length
of the string being parsed is l), requiring at least 3**(l/2) states.
-- edp
|
203.46 | | R2ME2::GILBERT | | Tue Oct 08 1985 14:08 | 18 |
| If we find a unique suffix string (i.e., the string so far is: "...tUV",
where tU doesn't occur previously in the string), then, when searching
back looking for a repeated string, we need only search back to the "t"
(or twice that far, depending on how you look at it), since we're assured
that there are no previous occurences of tU.
Another performance improvement is to create an auxiliary string, as follows.
wherever the string "abaca" occurs in the original, we put a zero byte in
the auxiliary string, the non-zero bytes in the auxiliary string are simply
the lengths between the occurences of "abaca". This auxiliary string can
be used to greatly speed the part of the single-character algorithm that
searches back looking for a match.
BTW - I think I've found a place where the string begins "abacaXcbacaX",
where X is several tens of thousands of characters in length. That is,
the whole damned string very nearly repeats!
If anyone wants to discuss this problem 'interactively', let me know....
|
203.47 | | SPRITE::OSMAN | | Tue Oct 08 1985 16:31 | 23 |
| A vax-specific performance idea in the standard case:
By standard case, I mean making sure the new character differs from previous,
then make sure the last pair differs from next to last pair, then make sure
last triplet differs from previous triplet, etc. through making sure
last half of string differs from first half.
From the vax point of view, perhaps we should represent the string as
8-bit bytes, one per character, and let the string grow "backwards", i.e.
towards smaller-valued addresses.
That way, we can start with newest character, and use a single vax
instruction to search for previous occurances of that character in
string. Only substrings ending at such spots need be checked.
For that matter, there's most likely a vax instruction that actually
searches for a substring. So we can search for previous occurance of
last "small piece". Only such occurances need be checked more carefully,
and for sufficiently large "small pieces", very few occurances are
likely to be found.
Is this useful ?
/Eric
|
203.48 | | BEING::POSTPISCHIL | | Tue Oct 08 1985 18:20 | 15 |
| Re .46:
Doesn't finding a unique suffix string take a long time?
Without any auxiliary strings, do you know how many comparisons you would
have to make to determine a character being added causes no repititions?
Do you use more than one auxiliary string or replacements for more than one
string in the auxiliary string?
How long (in CPU-time and real time) did it take to find however many characters
you have found so far?
-- edp
|
203.49 | | RAINBO::GRANT | | Tue Oct 08 1985 22:29 | 74 |
| Good work so far, everybody!
I've been away and just catching up. I have discovered and tested one
procedure for generating a long NR string that after 20000 characters has only
had single and double backups. It is certainly not the lexically first
string. The procedure goes something like this:
Start with a letter, and a direction.
(a direction is either A->B->C->A... or C->B->A->C..)
Generate the string by "travelling" in one direction as far as you can without
duplicating a substring, then switch directions. If you get stuck, you need
to backup, find what the old direction was, and switch that.
I have it coded in C (the code is carved out from the other C program in a
previous response). I will put the code as a response if I get a mail
request, or the printout of the first N characters.
It seemed like the procedure would be interesting to the people who are trying
to prove that some procedure generates an infinitely long member of NR. That
the procedure so far has only single and double backups suggests that it might
be more easily proven to be never-ending. There were 69 double backups in
20000 characters.
Here are the first few steps of the procedure:
Start with A, and travel "up".
A
AB
ABC
ABCA
ABCAB
ABCABC nope. switch direction.
ABCAB A
ABCAB AC
ABCAB ACB
ABCAB ACBA
ABCAB ACBAC nope. switch.
ABCAB ACBA B
ABCAB ACBA BC
...
ABCAB ACBA BCABC nope. switch.
ABCAB ACBA BCAB A
ABCAB ACBA BCAB AC
...
ABCAB ACBA BCAB AC ABCA CBAC ABCA
ABCAB ACBA BCAB AC ABCA CBAC ABCAB nope. switch.
ABCAB ACBA BCAB AC ABCA CBAC ABCA C oops. This doesn't work either. backup.
ABCAB ACBA BCAB AC ABCA CBAC ABCA
ABCAB ACBA BCAB AC ABCA CBAC ABC B We had been going "up" back here. switch.
ABCAB ACBA BCAB AC ABCA CBAC ABC BA
...
ABCAB ACBA BCAB AC ABCA CBAC ABC BACB CABC BAC ABCA CBAC ABC BACB CAB ACBA
....
The next few backups are around characters 99, 156, 228, 289, 361, 400, and
472.
The first double-backup happens at around character 2112.
Double backups usually occur in clumps of 4 or 5, each 101 apart. There are
many other near-regularities like this that are intriguing.
I am next seeing if there is a simple way to modify the procedure to have
either no double-backups or no backups at all. I am looking at more
complicated versions of "direction" that would somehow indicate usefully the
recent state of the string.
Looking at the print out and thinking about this puzzle has led to lots of new
thoughts. Thanks to the originator. I can see that this problem has only
begun!
-Jim Grant, Littleton MA.
|
203.50 | | R2ME2::GILBERT | | Tue Oct 08 1985 23:58 | 64 |
| If you want a fast program to generate an arbitrarily long string in NR
(and you aren't choosy about whether it's lexically first), the following
program will generate N characters in O(NlogN) time and O(logN) space.
Within a constant of proportionality, this is best possible.
Can we now return to matters of importance, like whether the regular
back-tracking algorithm will ever back up more than 5 characters, or how
to quickly generate the first N characters of the lexically first string?
module fast_nr (main=main, addressing_mode(external=general)) =
begin
external routine
lib$get_vm,
lib$put_output;
bind
convert = uplit(
uplit(%asciz'abacabcacbabcbacabacbabc'),
uplit(%asciz'abacabcacbabcabacbabcacbc'),
uplit(%asciz'abacabcacbabcabacbcabcbabc')): vector;
own
stack;
routine getch(ps) =
begin
bind s = .ps : ref vector[2];
local x;
if s[0] eql 0 then
begin
lib$get_vm(%ref(2*%upval), s);
s[0] = 0;
s[1] = 0;
return 'a';
end
else if .s[1] eql 0 then
begin
s[1] = .convert[getch(s[0])-'a'];
ch$rchar_a(s[1]); ! Skip the 'a'
end;
x = ch$rchar_a(s[1]);
if .x neq 0 then return .x;
s[1] = .convert[getch(s[0])-'a'];
x = ch$rchar_a(s[1]);
return .x;
end;
routine main =
begin
literal
k_line = 80;
local
buff: vector[k_line,byte],
desc: vector[2] initial(k_line, buff[0]);
stack = 0;
while 1 do
begin
incr i from 0 to k_line-1 do buff[.i] = getch(stack);
lib$put_output(desc[0]);
end;
end;
end
eludom
|
203.51 | | R2ME2::GILBERT | | Wed Oct 09 1985 00:09 | 6 |
| re 203.48:
Finding a unique suffix string takes about as long as the search-back to check
one character. The algorithm I'm using looks for a unique suffix every 1024
characters. The only use made of this is to limit the search-back amounts, and
this is the only automated optimization that's working so far.
|
203.52 | | LATOUR::AMARTIN | | Wed Oct 09 1985 00:48 | 39 |
| Re .45:
>You're not using "DFA" for "deterministic finite-state automaton", are you?
Yes, I am.
>If so, there exists a single automaton capable of recognizing all finite
>members of NR,
Can you prove that the set of all finite strings in NR is regular?
I haven't yet seen a proof that the language is context-sensitive
or context-free, let alone regular.
>although it requires some sort of infinite memory (such as the
>tape for a Turing Machine).
If so, it isn't a DFA. The only memory in a DFA is encoded in the current
state number, and the set of states is finite.
>On the other hand, things such as finite-state parsers will get larger
>as the members of NR to be recognized increase. The increase will be
>at least exponential:
Indeed, it can't be any worse than exponential. It only takes 3**k states
to recognize any subset of strings of length <= k from a 3 symbol alphabet.
>At a state in the middle of parsing a string, the parser
>state will need to "encode" at least the last l/2 characters (where the length
>of the string being parsed is l), requiring at least 3**(l/2) states.
Exactly the last l/2 characters, since the parser has only seen l/2
characters.
The proof claims that the DFA for finite members of NR of length <= k
has O(3**k) states. But the same proof can be used to prove that the DFA
for all members of (a+b+c)* with length <= k has O(3**k) states. (The proof
makes no distinctions about the set of strings being recognized). Yet we know
that the DFA has k+1 (O(k)) states. I don't believe the proof.
/AHM
|
203.54 | | R2ME2::GILBERT | | Wed Oct 09 1985 01:59 | 14 |
| re 203.48:
Here are the accumulated CPU times for the algorithm, at 10K intervals:
10K - 1:32
20K - 4:53
30K - 10:04
40K - 25:43
50K - 32:25
60K - 53:24
From about 40K to 60K, the unique suffix stayed at 40K, so the going's slow...,
it's practically duplicating the entire string! The algorithm will speed up
for a little while after this hurdle.
|
203.55 | | BEING::POSTPISCHIL | | Wed Oct 09 1985 10:01 | 48 |
| Re .51, .54:
After I entered my response, I came to the same conclusion you did about
the unique suffix string: It takes as long to find as the process of validating
one new character. In fact, the two processes can be performed simultaneously,
except that the search for the unique suffix apparently needs to continue
back to the exact middle of the string, rather than stopping at the previously
used unique suffix string. If we could find some way to show that the search
could stop at that unique suffix string, we would be finding unique suffixes
with almost no extra work.
Thanks for the CPU-times, at least the task seems to be moderately reasonable.
Re .52:
>> At a state in the middle of parsing a string, the parser
>> state will need to "encode" at least the last l/2 characters (where the length
>> of the string being parsed is l), requiring at least 3**(l/2) states.
>
> Exactly the last l/2 characters, since the parser has only seen l/2
> characters.
The "middle" was not intended to mean exactly the middle. Once the parser gets
beyond the exact middle, it must record more than the last l/2 characters. Some
extra information must be known about what characters have matched previous
characters.
> The proof claims that the DFA for finite members of NR of length <= k
> has O(3**k) states. But the same proof can be used to prove that the DFA
> for all members of (a+b+c)* with length <= k has O(3**k) states. (The proof
> makes no distinctions about the set of strings being recognized). Yet we know
> that the DFA has k+1 (O(k)) states. I don't believe the proof.
By "the proof", are you referring to the short remarks I made showing at least
O(3**k) states are required? If so, it certainly does make a distinction about
the set of strings being recognized, although it is implicit (when I present a
proof, you'll know it). When I stated 3**(l/2) states were required, I did not
say why. Consider the parser after at least l/2 characters have been read. For
each of the 3**(l/2) possible strings of l/2 characters that might have been
just read, there exists a string of l/2 characters that could come next to
invalidate the string. Each of these future-strings is different for each of
the past-strings, so the parser must know, as it starts to examine each of the
future-strings, _exactly_ which past-string was read. Hence, at least 3**(l/2)
states are required.
-- edp
|
203.56 | | GOLLY::GILBERT | | Wed Oct 09 1985 13:40 | 17 |
| Re .51, .54, .55 (unique suffix strings):
The starting position of the unique suffix string never moves leftward,
and tends to stay in the same place during the generation of thousands
of characters. Thus, looking for a new unique suffix once every 1000
generated characters is relatively cheap. It could be improved by checking
whether the previous unique suffix is still the shortest unique suffix,
by checking whether we're still generating the same stream of characters
as were generated at the occurence of an "sUV" in the string. But improving
this aspect doesn't seem particularly fruitful. Better would be to discover
why the string is repeating, and copy the entire repetition in one fell swoop.
By the way, I have some long equivalents of Bliss' ch$compare, ch$find_sub, and
ch$move, if anyone is interested. These work with lengths up to 4 billion,
instead of only working up to lengths of 65K. Send mail if you want 'em.
- Gilbert
|
203.57 | | TOOLS::STAN | | Wed Oct 09 1985 17:15 | 4 |
| Here's a reference to the original problem:
Yaglom and Yaglom, Challenging Mathematical Problems with Elementary
Solutions, volume 2, Holden-Day, problems 124 and 125.
|
203.58 | | SPRITE::OSMAN | | Thu Oct 10 1985 11:54 | 16 |
| My source was MIT Articifical Intelligence Lab Memo #239 entitled
"Hakmem". It's a collection of technical and mathematical problems,
discoveries, observations etc.
I have finally gotten BUMP working. The result is:
Searching s(22840) or whatever that long string was that Mike
Gilbert published, ALL VALID SUBSTRINGS through CBACB are found
in it. CBACB is the first one not found !
Question:
Will CBACB appear in lexically-first string ? Exactly where does
it appear. If not, can someone prove why it will never appear ?
/Eric
|
203.59 | | BEING::POSTPISCHIL | | Mon Oct 14 1985 19:11 | 28 |
| In a procedure that is looking for the lexically-first infinite string of NR,
suppose there is some sort of list of potential repetitions and that the current
length of the string is l. I think all the procedure needs to do is check the
(hopefully small) list of potential repetitions to see if each one continues,
completes, or fails at this point and then check to see if any new potential
repetitions have cropped up. To guarantee that any potential repetition is
found at or before the time it is completed, there are only a few comparisons
that need to be made for each position: Compare the character in position l-1
(with numbering starting at 0) to the character at position l-1-f, for each f
less than l that divides l. If no match is found, there can be no potential
repetition of length f ending anywhere in the next f characters, after which
time the procedure will again check for possible repetitions of length f,
because f divides l+f. If a match is found, it will be necessary to search back
some ways, since a repetition may have started quite some time ago, although if
it has completed, it must have completed only with the addition of the most
recent character.
There are some "minor" problems with this routine. For example, when a member
of the list of potential repetitions is disqualified because it does not match a
character, it cannot always be thrown away -- a conflict in the future might
cause the procedure to backup beyond the disqualifying point, so they must
be retained until they are relatively safe to discard (five positions?).
Other than similar things, it looks quite promising, and I am working on
implementing it.
-- edp
|
203.60 | | KOBAL::GILBERT | | Tue Oct 15 1985 12:30 | 66 |
| Re 203.59:
I'm not sure I understand the comment about f being a divisor of l. Is this
because the algorithm only checks for repeated substrings of length f every
f characters?
Re displaying strings:
A Lempel-Ziv style algorithm gives another convenient way to 'display'
the string. For example, if we let "(offset,length)" indicate the string
offset and length, then the first 1095657 characters can be written:
abacabcacbabcabacabcacbacab(7,17)bcabacb(17,29)(22,22)(24,26)(42,38)
(46,28)(42,29)bcab(28,22)(115,111)(126,84)c(220,22)ac(109,17)(226,124)
(226,108)(434,175)(22,58)ab(877,35)(464,233)(436,447)(1123,336)(200,105)
(213,75)(118,222)(3328,63)(1211,2139)(2125,293)(213,61)(6735,2416)
(1196,1156)(3350,2578)(3396,2429)(1196,837)(12815,5007)(935,1054)
(9182,15322)(5927,538)(40036,270)(6465,34115)(6465,33847)(981,2369)
(2125,73)(198,152)(226,94)(883,5045)(935,43)(912,2490)(883,2508)
(6750,68215)(40580,485)bacab(1208,538)(190287,268)(436,255)(2008,166)
(18697,4983)(196,33)b(874,5054)(877,2511)(195368,8684)(196534,6303)
(12815,3839)(3350,83)(118,222884)(190290,32709)(190294,32700)
(190293,32696)(24,543949)(949,1153)(6687,6557)
Please forgive any typos. Also, an independent verification of the first
million characters is requested.
Re commonly occuring substrings:
Can two occurences of the string "aba" be arbitrarily far apart in an infinitely
long string of NR? I think so. However, it appears that there must be an
infinite number of occurences of the string "abc" (that is, there's a small
bound on how far apart they can be).
Re backing up:
How far might we need to back up? Note that, to produce a 'bad' pattern
that causes a large backup, there must be several repeated substrings of
diffferent lengths (and with lengths greater than, say 6), all ending at
about the same place, e.g.:
[-----------][-----------]
[---------------][---------------]
Now we can equate parts of these strings, and we see:
[defghij----][defghij----]
[--------defghij][----defghij----]
-or- [----defghij----][----defghij----]
Thus, we see that we must've already produced a repeated substring (defgdefg).
That is, the longer of these two repeated substrings is 'too short'. At the
smallest it must be about twice as long as the shorter repeated substring:
[defghijklm-][defghijklm-]
[fghijklm---defghijklm-][fghijklm---defghijklm-]
With a little precision, we should be able to limit the number of repeated
substrings that can end within a small range of r characters to something
like log2(l)-log2(r), where l is the length of the string generated so far.
Because a backup of length greater than r requires 3r repeated substrings,
and a significant number of these repeated substrings (how many?) must have
length greater than 2r, we should be able to limit the maximum backup amount
to something like log2(l).
- Gilbert
|
203.61 | | BEING::POSTPISCHIL | | Tue Oct 15 1985 13:02 | 10 |
| Re .60:
> I'm not sure I understand the comment about f being a divisor of l. Is this
> because the algorithm only checks for repeated substrings of length f every
> f characters?
Yes.
-- edp
|
203.62 | | R2ME2::GILBERT | | Wed Oct 16 1985 13:28 | 15 |
| abacabcacbabcabacabcacbacab(7,17)bcabacb(17,29)(22,22)(24,16)(42,38)(46,28)
(42,29)bcab(28,22)(115,111)(126,84)c(220,22)ac(109,17)(226,124)(226,108)
(434,175)(22,58)ab(877,35)(464,233)(436,447)(1123,336)(200,105)(213,75)
(949,1176)(126,45)(6,18)(881,2515)(1243,503)(423,34)(118,222)(3328,63)
(1211,2139)(2125,293)(213,61)(6735,2416)(1196,1156)(3350,2578)(3396,2429)
(1196,837)(12815,5007)(935,1054)(9182,15322)(5927,538)(40036,270)(6465,34115)
(6465,33847)(981,2369)(2125,73)(198,152)(226,94)(883,5045)(935,43)(912,2490)
(883,2508)(6750,68215)(40580,485)bacab(1208,538)(190287,268)(436,255)
(2008,166)(18697,4983)(196,33)b(874,5054)(877,2511)(195368,8684)(196534,6303)
(12815,3839)(3350,83)(118,222871)(24,445831)(949,1153)(121523,68735)
(9182,30860)(892863,68736)(190258,255622)(222989,121424)(892864,544175)
(219087,120355)(201595,11189)(204052,1216)(111502,5053)(115,229)
(2115176,5272)(434,310)(2115176,10851)(6465,439415)(222989,5832)
(190292,38522)(992523,443260)(195368,14948)(2620660,453155)(126,224)
(874,445006)(222989,639)(6201,22108)
|
203.63 | | R2ME2::GILBERT | | Wed Oct 16 1985 15:05 | 50 |
| The preceding represents the first 4,000,000 characters of the lexically
first infinitely-long string in NR, and is displayed using the technique
described in 203.60 (routines for decrypting and displaying this format
will be available shortly). Excluding time spent debugging and tuning,
generating these 4M characters took under 4 CPU hours.
The biggest performance gain was by using the extension technique in 203.40.
Most of the extensions were of several thousand characters, and some were
of hundreds of thousands of characters. At some places in the string, the
search-back was taking 5 or 6 seconds to check a single character! (and it
only searched back to (twice) the length of the shortest unique suffix string).
There were many 'backup' amounts of 5 (one at offset 2582138 was interesting),
but no 'backups' of 6 or more. I need to re-run the program to collect
information about 'search-back' amounts, as I think they may be useful in
understanding how this string behaves, and in displaying the string.
The display used in 203.62 is not particularly useful, although it is terse
(compare with 203.7, which is only 0.5% the length). Because it searches
for longest matches, part of a long match may obscure the fact that the next
chunk starts at a very small offset; for example, I suspect (24,445831) really
represents a repetition that started at offset 2.
The macro display technique in 203.31 is better (patterns are readily visible),
but the particular choice of macroes is poor, because some macroes are prefixes
of others. Can a better choice of macroes be found?
It seems reasonable that there might be a small set of macroes that could be
used, such that the set has the 'uniquely identifying substring' property, as
described in 203.8. That is, any 'long' repeated substring must 'split'
between the two macroes. This should give a better set of macroes (so that
the string can be displayed better), and might mean that the the algorithms
could be applied to the macro names, instead of the characters a, b and c.
Looking at the 'search-back' amounts, and the strings that nearly repeat
should help find such a set of macroes (if they can indeed be found).
It is still interesting to wonder why there are so few different 'search-back'
amounts. There are only 10 different 'search-back' amounts greater than 64
in the first 85K characters (see notes 203.15 and 203.31). This means that,
within the first 85K characters, at most 74 (64+10) string comparisons are
needed to check whether a character can be appended to the string.
Notes 203.23 and .24 ask about counting the number of 'valid' substrings of
a particular length. Can any upper or lower bounds be given? How does the
function seem to grow?
Also, could Eric's BUMP program be used to determine whether there's a small
bound on how far apart occurences of "aba" may be?
- Gilbert
|
203.64 | | SPRITE::OSMAN | | Wed Oct 16 1985 17:18 | 32 |
| Here's an algorithm that I believe will test to see if a new last letter
is valid fairly efficiently.
-------------------------- beginning of algorithm ---------------------
Start with test substring equal to last letter.
Start with position to search backwards from as end of string.
[A:] Search backwards for second occurrence of substring.
If substring not found, entire string is valid.
If substring is found in position just to left of first occurence,
entire string is invalid.
We found second occurrence farther back than immediately abuting first
occurrence, so . . .
Set test substring to everything from just to right of second occurrence
through last letter.
Set position to search from to be just to right of that second occurence.
Go back to [A:].
--------------------------- end of algorithm -------------------------
If we arrange string BACKWARDS in memory, then the vax MATCHC instruction
can be used for the searching.
I'm experimenting with an implementation of this.
/Eric
|
203.65 | | BEING::POSTPISCHIL | | Wed Oct 16 1985 19:27 | 18 |
| The procedure I described previously (search for partial repetitions of length
f every f characters and observe them as more characters are added to make
sure they do not complete) requires, when finding the first 1000 characters,
an average of 10.5 comparisons of one character to another to determine whether
a new character makes the string invalid or not. However, it is still slower
than Gilbert's program. Either Gilbert makes up for comparisons by occasionally
being able to copy large stretches of the string or I lose by the overhead
involved in factoring numbers and handling data structures.
I will be working on the program some more, since there are lots of places
I know improvements can be made, but if anybody wishes to look at it, everything
is in BEING""::USER1:[POSTPISCHIL.PUB]. The sources are in *.H and *.C.
Ignore the X.C file, it is a version of NR.C that collects some extra statistics
as it works. The program is in NR.EXE. Run it and type in the number of
characters you would like it to find.
-- edp
|
203.66 | | LATOUR::AMARTIN | | Wed Oct 16 1985 23:48 | 26 |
| Re .63:
There is a paper titled "Analyzing and Compressing Assembly Code" by Christopher
W. Fraser, Eugene W. Meyers and Alan L. Wendt of U of Arizona on pp 117-121
of the proceedings of the Sigplan '84 Symposium on Compiler Construction
(which was published as Vol 19, #6 (Jun-85) issue of SIGPLAN Notices.
The authors describe an algorithm to compress straight-line assembly code
into subroutine calls to save space (the algorithm looks for patterns in
the code, and creates subroutines out of repeated code sequences). The paper
talks about using a Patricia tree to find the patterns. References for
this point to:
3. D. E. Knuth, "The Art of Computer Programming, Volume 3: Sorting and
Searching" (for Patricia trees).
4. E. M. McCreight, "A Space-Economical Suffix Tree Construction Algorithm",
JACM V23, #2 (Apr-76), pp 262-272 (also for Patricia trees).
5. M. Rodeh, V. R. Pratt and S. Even, "Linear Algorithm for Data Compression
via String Matching", JACM, V28, #1 (Jan-81), pp 16-24. (A different
approach?).
Perhaps there is something worthwhile in these references. I haven't looked
at the 3 latter papers, so I don't know if any of this is redundant.
/AHM
|
203.67 | | ALIEN::POSTPISCHIL | | Thu Oct 17 1985 14:37 | 20 |
| I've got my program into something resembling a decent shape. It takes 34
CPU-seconds (on a VAX-11/780) to find the first 10,000 characters, as opposed
to Gilbert's 92 seconds. Better than that, it appears to be increasing only
a little more than linearly:
number actual (number/2000)*
found time (time for 2000)
1000 -- 2.9 2.9
2000 -- 5.9 5.8
3000 -- 9.0 8.7
4000 -- 12.1 11.6
5000 -- 15.6 14.5
10000 -- 34.2 29.0
Gilbert, what type of CPU were your times for? What language did you work
in? If everything goes well, I will be ready to compare our first million
characters soon.
-- edp
|
203.68 | | ALIEN::POSTPISCHIL | | Thu Oct 17 1985 14:57 | 5 |
| Gilbert's first 22,528 characters agree with mine. It took my program 86.6
CPU-seconds to find them (86.6/34.2 = 2.53, 22528/10000 = 2.25).
-- edp
|
203.69 | | R2ME2::GILBERT | | Sun Oct 20 1985 00:28 | 8 |
| Two occurences of "abc", can be separated by no more than 25 characters.
But two occurences of "aba" can be 1000 characters apart (by construction).
Can we prove that they can be arbitrarily far apart? Note that the following
simultaneous substitutions could almost (but not quite) be used to prove this:
a -> abc
b -> ac
c -> b
|
203.70 | | TOOLS::STAN | | Sun Oct 20 1985 14:06 | 28 |
| Some more references:
Richard K. Guy, Unsolved Problems in Number Theory. Springer-Verlag.
New York: 1981. pp 124-125. Problem E21.
F. M. Dekking, On repetitions of blocks in binary sequences.
J. Combin. Theory Ser. A, 20(1976)292-299.
F. M. Dekking, Strongly non-repetitive sequences and progression-free sets.
J. Combin. Theory Ser. A 27(1979)181-185.
R. C. Etringer, D. E. Jackson and J. A. Schatz, On non-repetitive sequences.
J. Combin. Theory Ser. A 16(1974)159-164.
D. Hawkins and W. E. Mientka, On sequences which contain no repetitions.
Math. Student 24(1956)185-187.
G. A. Hedlung, Remarks on the work of Axel Thue on sequences.
Nordisk Mat. Tidskr. 15(1967)147-150. MR 37 #4454.
Helmut Prodinger and Friedrich J. Urbanek, Infinite 0-1 sequences without
long adjacent identical blocks. Discrete Math. 28(1979)277-289.
A. Thue, Uber unendliche Zeichenreihen, Norse Vid. Selsk. Skr. I Mat.-Nat.
Kl. Christiania (1906), No. 7, 1-22.
A. Thue, Uber die gegenseitige Lage gleicher Teile gewisser Zeichenreihen,
ibid. (1912), No. 1, 1-67.
|
203.71 | | DEMON::OSMAN | | Mon Oct 21 1985 16:46 | 6 |
| Unfortunately, the algorithm in 203.64 seems to be awful.
Apparently, the search matches in more places than I expected. Hence we
end up doing to much work.
/Eric
|
203.72 | | DEMON::OSMAN | | Mon Oct 21 1985 16:48 | 9 |
| I'm curious about that problem E21 in "Unsolved Problems in Number
Theory" (see 203.70).
Is the "unsolved problem" that of proving the existence of arbitrarily
long NR strings ? In which case Gilbert is now famous ?
If not, can someone type in the problem ?
/Eric
|
203.73 | | TOOLS::STAN | | Mon Oct 21 1985 17:06 | 3 |
| Problem E21 is a hodge-podge of known results and open problems.
A typical open problem is the one I entered here in MATH.NOT
as note 360.
|
203.74 | | ALIEN::POSTPISCHIL | | Mon Oct 21 1985 18:45 | 23 |
| Re .71:
I'm not sure I understand the algorithm in .64. I thought I did until I
saw .71. There is a simple way to check to see if the most recently-added
character caused a repetition, and it requires a number of
character-to-character comparisons less than or equal to half the length
of the string.
Set i to the last position in the string. Set j to the previous position.
[Label A] Compare character i to character j. If they are equal and the
length of the string minus j equals j minus i, the string has a repetition,
so quit. If they are equal, decrement i. Otherwise, set i to the last
position in the string. Decrement j. If j is non-negative, go to A.
The string has no repetition, quit.
Gilbert has improved upon this by limiting the search, so that it is not
necessary for j to go all the way to zero. He also is able to avoid the
check sometimes by copying many characters from previous places in the string.
I have improved on this by coordinating my checks for substrings based on
their lengths.
-- edp
|
203.75 | | SPRITE::OSMAN | | Tue Oct 22 1985 15:08 | 28 |
| There's something wrong with .74. Here's the beginning of it, reprinted:
>Set i to the last position in the string. Set j to the previous position.
>[Label A] Compare character i to character j. If they are equal and the
>length of the string minus j equals j minus i, the string has a repetition,
>so quit.
Let's try it. Suppose we've got
...AA
and we're using .74 to check it. Let's assume we've now got 100 characters.
"Set i to the last position in the string." i = 100
"Set j to the previous position." j = 99
"Compare character i to character j. Compare "A" with "A".
"If they are equal" yup!
"and the length the string minus j" 100-99 is 1
"equals j minus i" 99-100 is -1 Not equal!
Algorithm hence doesn't quit here, and I would have expected it to.
/Eric
|
203.76 | | BEING::POSTPISCHIL | | Tue Oct 22 1985 16:20 | 9 |
| Re .75:
Oops, that should read "and the length of the string minus i equals i minus
j, the string has a repetition . . .". Also, start counting positions at
zero (so the last position in a string of length 100 is 99) or insert a
"plus one" after "length of the string".
-- edp
|
203.77 | | R2ME2::GILBERT | | Tue Oct 29 1985 21:19 | 52 |
| I happened to run across this problem in "Jewels of Formal Language Theory",
by Arno Salomaa (Computer Science Press). I just happened to be looking
for a nearby book in the ZK library, happened to pick this up, and this was
the first problem! It's known as Thue's Problem, after Axel Thue, a formal
language theorist of the early 1900s.
There are actually three problems; we've been looking for square-free w-words
(omega-words are 'infinitely long' strings) over an alphabet of 3 letters.
Similarly, there's the problem of finding cube-free w-strings.
Here's an outline of how the problem was solved there:
Define the sequence of words w(i+1) = w(i)w'(i), w(0) = "a", and x' denotes
the string obtained from x, by interchanging "a" and "b". Thus, w(5) is:
a b ba baab baababba baababbaabbabaab (spaces for readability).
Let y be the w-word generated by this (note the similarity between this
string and the 'parity' of the integers). Then, it's shown that y is cube-free.
Now, we define another w-word, z, based on adjacent characters in y (not
really *pairs* though -- this is a length-preserving homomorphism), thus:
[aa] -> 1, [ab] -> 2, [ba] -> 3, [bb] -> 4.
In this notation, z begins:
2432312431232432312324312432312...
Note that 1 is always preceded by a 3 and followed by a 2; a 4 is always
preceded by a 2 and followed by a 3. The string z is square-free.
By replacing each occurence of 4 in z with a 1, we get a square-free string
over an alphabet of 3 letters.
A stronger notion that square-freeness is also considered. A w-word over
an alphabet of n letters is called "irreducible" if, whenever it contains
a subword xyx (where x is non-empty), then |y| >= n-2. For n=3, this is
equivalent to square-freeness.
Also, the following theorem is mentioned: There is a w-word w over an alphabet
of n letters such that, whenever xyx (where x is non-empty) is a subword of w,
then |y| >= |x|/3. This result is proved by starting with "a" and using the
substitutions (such a deterministic generation is called a D0L system):
a -> abc acb cab c bac bca cba
b -> bca bac abc a cba cab acb
c -> cab cba bca b acb abc bac
The 1/3 is best possible, and the above substitutions cannot be 'shortened'.
The string y above (2432...) can be generated by the D0L system:
1 -> 2431, 2 -> 2432, 3 -> 3123, 4 -> 3124, starting with 2.
The following morphism is what Thue had originally used:
a -> abcab, b -> acabcb, c -> acbcacb.
The 'permutation-square-free' problem isn't mentioned.
|
203.78 | | R2ME2::GILBERT | | Sat Nov 02 1985 12:17 | 13 |
| While it lasts, the file HARE::SYS$PUBLIC:ABC.A is a backup save-set containing
a variety of routines I've used in working on the NR and NP problems (including
routines to compress/decompress the strings in the LZ-style format). The save-
set also includes a DESCRIP.MMS file for building the various executables.
The programs for finding the (beginning of the) lexically first w-string in NR
and NP initialize the string with what's known so far (see NRSTRING.REQ and
NPSTRING.REQ), and continue from there.
If any errors are found (particularly in the UTIL$ or CH$ routines), please let
me know (preferably by MAIL).
- Gilbert
|
203.79 | | R2ME2::GILBERT | | Sat Nov 02 1985 13:08 | 42 |
| Since we've the first 6 million characters (I'm still puzzled about how Eric
Postpischil's program got so far so fast extending only a single character at
a time), it's about time we started to seriously analyze the string, trying to
find some patterns.
For example, assuming the string can be generated by a D0L system (i.e., just
a bunch of substitutions on successively longer strings), how long must the
substitution strings be?
What if each step of the substitutions takes the string (x), makes substitutions
for each of the characters, and then prefixes the result with some short string?
(this will solve the problem with the first 20 characters never reappearing).
What if, instead of doing substitutions of 4 different characters in each step,
we have (say) 10 different characters, and then to get the string, we do a
simple substitution of these to get just 4 characters?
With the decompress subroutines (see 203.78), Eric Osman should be able to find
the shortest substring that doesn't occur in the first 6M characters (but which
could occur). Could someone give a plausible argument to explain the character
frequencies (note that "b"s occur significantly more than 1/3 of the time);
also, what are the frequencies for letter pairs (do "ac" and "ca" occur with
roughly equal frequency?).
The ratios of "a"s/"b"s is not equal to the ratio of "b"s/"c"s. I'd expect
these to be about the same (since the probability that an "a" can be appended
should be about the same as that for appending a "b" or "c" -- or should it,
given the different frequencies of the characters?). Maybe the probabilities
start out about the same, but backing up causes "a"s to be more likely? This
idea may help explain the letter frequencies.
I'm unsure whether I was correctly figuring the backup amounts -- it now seems
that there were backups of 9 characters (within the first 1000 generated).
could someone independently check this?
Asymptotically, how many strings are there of length n? When appending a
character, there's a 1/3 chance (roughly) it's disallowed by a repetition of
length 1, and a 1/4 (?) chance it's disallowed by a repetition of length 2,
and a 1/8 (?) chance it's disallowed by a repetition of length 3. In all
the strings of length n, do "abca" and "abac" occur equally often?
- Gilbert
|
203.80 | | SPRITE::OSMAN | | Mon Nov 04 1985 15:44 | 73 |
| I have a respectable time now for a program that computes the LFS
(lexically-first string).
I never did understand you guy's "extension technique", so my program
doesn't use it. This, I believe adds to the respectability of my
time.
Here's a tally of the times I know about:
Who Number of characters time note, date
--- -------------------- ---- -----------------
SPRITE::OSMAN 10K 00:00:49 203.80 11/4/85
40K 00:09:49 * 203.80
BEING::POSTPISCHIL 10K 00:00:34 * 203.67 10/17/85
40K ???????
GOLLY::GILBERT 10K 00:01:32 203.54 10/9/85
40K 00:25:43 203.54
* indicates currently known record
Feel free to reproduce the table in other replies with updates as
necessary.
I've been working on my program gradually. Although it's not very long,
it has virtually no comments in it (yeah, I know, booo, hisss). It's
a combination of BLISS and C, plus Gilbert's CH$ module.
The only reason I use BLISS instead of all C, is C's refusal to allow
me to get at *in-line* use of INSQUE and CMPC3 and MATCHC instructions.
The main features of my program:
o The program works with a "group size"(gs), typically 4. This is
the number of characters appended at once.
o The program initializes a "groups" table which contains all possible
strings of gs characters. For gs=4, the table contains
ABAC, ABCA, ABCB, ACAB, ACBA, ACBC, BABC, etc.
Then program initializes "next_abut" table which indicates which
group of four characters to try next if a group fails. For instance,
the table indicates that if we've just decided that "...ABAC ABCB"
is invalid, the table tells us what to replace the "ABCB" with.
Even though "ACAB" is next in the group table, the next_abut table
automatically skims us over that "obvious" invalidity of "...ABAC ACAB".
o The program works with a "longsize", typically 8. When checking
for validity, substring lengths 1 through longsize are checked the
standard way, namely by using CH$EQL.
For longer substrings, CH$COMPARE_L (Gilbert's generalized
CH$COMPARE) is called *only* for substrings ending with the
same 8 characters as the current last 8. The locations of
these are *remembered* in linked lists as I go, rather than
having to be scanned or searched for.
The times I reported up above for my program were computed from:
$ show process/accounting
run program
$ show process/accounting
The "elapsed CPU time" is subtracted.
My time *includes* the time it took the program to print out the string,
which was done in batch, so we're dealing with PRINTF to SYS$OUTPUT which
is a .LOG file. I haven't measured how much time is spent doing just
the output, but I suspect it's not much. I use a loop, printing 80
characters per PRINTF.
/Eric
|
203.81 | | BEING::POSTPISCHIL | | Mon Nov 04 1985 17:02 | 25 |
| Who Number of characters time note, date
--- -------------------- ---- -----------------
SPRITE::OSMAN 10K 00:00:49 203.80 11/4/85
40K 00:09:49 203.80
BEING::POSTPISCHIL 10K 00:00:34 * 203.67 10/17/85
40K 00:02:48 * 203.81 11/4/85
1M 01:48:30 203.81 11/4/85
4M 08:52:42 203.81 11/4/85
GOLLY::GILBERT 10K 00:01:32 203.54 10/9/85
40K 00:25:43 203.54
4M 03:??:?? * ? ?
* indicates currently known record
My times are figured in the same way as Eric's, although I do not know that
we are using the same type of CPU. They are all for the same program; the
amount of time seems to be between n log n and n log**2 n. I think Gilbert's
times are for different programs, and it may be hard to predict how they
increase.
-- edp
|
203.82 | | BEING::POSTPISCHIL | | Mon Nov 04 1985 19:16 | 27 |
| Re .62, .63:
My results differ from Gilbert's. The compression of my result, using a
routine provided by Gilbert, follows the form-feed. Strangely, our compressions
differ in the second line but are then identical until the eleventh line.
-- edp
ABACABCACBABCABACABCACBACAB(7,17)BCABACB(17,29)(22,22)(24,16)(42,38)(46,28)
(42,29)B(114,12)(4,16)(118,108)(126,84)C(220,22)AC(109,17)(226,124)(226,108)
(434,175)(22,58)AB(877,35)(464,233)(436,447)(1123,336)(200,105)(213,75)
(949,1176)(126,45)(6,18)(881,2515)(1243,503)(423,34)(118,222)(3328,63)
(1211,2139)(2125,293)(213,61)(6735,2416)(1196,1156)(3350,2578)(3396,2429)
(1196,837)(12815,5007)(935,1054)(9182,15322)(5927,538)(40036,270)(6465,34115)
(6465,33847)(981,2369)(2125,73)(198,152)(226,94)(883,5045)(935,43)(912,2490)
(883,2508)(6750,68215)(40580,485)BACAB(1208,538)(190287,268)(436,255)
(2008,166)(18697,4983)(196,33)B(874,5054)(877,2511)(195368,8684)(196534,6303)
(12815,3839)(3350,83)(118,222871)(24,445831)(949,1153)(121523,68735)
(9182,30860)(892863,68736)(190258,255622)(222989,121424)(892864,743265)
(878,660)(191480,5035)(1209,519)(9182,30820)(7533,1649)(2218000,1243)
(154851,35435)(6680,33061)(40015,33837)(2218540,2189)(2221419,67893)
(9182,13273)(949,1084)(971076,21385)(8968,8819)(2391451,15669)(2377177,63411)
(2325994,189397)(118,445656)(2407120,110368)(2440588,39592)(9182,837)
(967895,24566)(8968,6462)(118,445762)(222989,9018)(965232,27229)(8968,2656)
(12815,13157)(1196,1098)(213,78)(121586,68700)(6680,19)(342,532)(350,234)
(434,51)(883,97100)
|
203.83 | | GOLLY::GILBERT | | Mon Nov 04 1985 22:12 | 12 |
| The difference on the second line is due to a minor change to the display
routine. Namely, the routine's been changed to use shorter matched strings.
If you check, you'll see that these are really the same.
Regarding the problem on the eleventh line, it appears that you missed the
545377-character repetition, ending at offset 1982418. Although this isn't
the first repetition that's greater than 65535 bytes in length (which is the
longest length comparable with a single VAX CMPC instruction), I suspect you
got lucky with the previous long repetitions, by deciding that you needed to
back up to those characters and replace them with the next larger character.
- Gilbert
|
203.84 | | BEING::POSTPISCHIL | | Tue Nov 05 1985 10:43 | 15 |
| Re .83:
I looked at the machine code produced by the C compiler, and, although I am not
familiar with the VAX instruction set, it does not seem to be using the CMPC
instruction or any other multiple-byte instruction. I guess that means there is
a bug in my program, although it is strange for it to show up so far into the
string. The length of the string at the point where our results diverge is
1982417, a prime number. I wonder if that has anything to do with it (my method
relies heavily on factoring numbers).
Also, the longest backup my program encountered was six (counting changing
an "A" to a "B" as a backup of one).
-- edp
|
203.85 | | R2ME2::GILBERT | | Tue Nov 05 1985 11:30 | 4 |
| Oh. Now I realize why I'm figuring longer backup amounts. This is because of
the way I've coded the extension technique, it sometimes shortens the string
(by up to 10 characters), and that's figured into the maximum backup amount.
I'll need to adjust the program. Sorry about the false alarm.
|
203.86 | | R2ME2::GILBERT | | Tue Nov 05 1985 20:04 | 19 |
| I changed my program so that it doesn't display the LZ-style string every
256 characters, and timed it.
Who Number of characters time note, date
--- -------------------- ---- -----------------
SPRITE::OSMAN 10K 00:00:49 203.80 11/4/85
40K 00:09:49 203.80
BEING::POSTPISCHIL 10K 00:00:34 203.67 10/17/85
40K 00:02:48 203.81 11/4/85
1M 01:48:30 203.81 11/4/85
4M 08:52:42 203.81 11/4/85
CLT::GILBERT 10K 00:00:11 * 203.86 11/5/85
40K 00:00:28 * 203.86 11/5/85
1M 00:13:39 * 203.86 11/5/85
4M 01:15:00 * 203.86 11/5/85
* indicates currently known record
|
203.87 | | R2ME2::GILBERT | | Tue Nov 19 1985 11:46 | 36 |
| The following gives the first 6000100 characters of the lexically
first omega-word in NR, where NR is the set of all square-free strings
over the alphabet {a,b,c}, and an omega-word is an arbitrarily long
(or infinitely long, if you will) string. See also Thue's problem.
Except for the parenthesized numbers, the notation is self-evident.
The parenthesized expressions (offset,length) represent a string of
'length' characters at that point in the string, that are the same
as the 'length' characters in the string, starting at offset 'offset'.
Note that the (offset,length) is only used to reference a previously
occuring substring.
For a short example, the string:
"abacabcacbabcabacabcacbacab"
Could be so represented by:
"abac(0,2)cacbabca(3,8)cab"
Since characters 0 thru 1 are "ab", and characters 3 thru 10 are
"cabcacba".
abacabcacbabcabacabcacbacab(7,17)bcabacb(17,29)(22,22)(24,16)(42,38)(46,28)
(42,29)bcab(28,22)(115,111)(126,84)c(220,22)ac(109,17)(226,124)(226,108)
(434,175)(22,58)ab(877,35)(464,233)(436,447)(1123,336)(200,105)(213,75)
(949,1176)(126,45)(6,18)(881,2515)(1243,503)(423,34)(118,222)(3328,63)
(1211,2139)(2125,293)(213,61)(6735,2416)(1196,1156)(3350,2578)(3396,2429)
(1196,837)(12815,5007)(935,1054)(9182,15322)(5927,538)(40036,270)(6465,34115)
(6465,33847)(981,2369)(2125,73)(198,152)(226,94)(883,5045)(935,43)(912,2490)
(883,2508)(6750,68215)(40580,485)bacab(1208,538)(190287,268)(436,255)
(2008,166)(18697,4983)(196,33)b(874,5054)(877,2511)(195368,8684)(196534,6303)
(12815,3839)(3350,83)(118,222871)(24,445831)(949,1153)(121523,68735)
(9182,30860)(892863,68736)(190258,255622)(222989,121424)(892864,544175)
(219087,120355)(201595,11189)(204052,1216)(111502,5053)(115,229)
(2115176,5272)(434,310)(2115176,10851)(6465,439415)(222989,5832)
(190292,38522)(992523,443260)(195368,14948)(2620660,453155)(126,224)
(874,445006)(222989,639)(6201,68368)(3532256,513853)(126,48)(1061231,377010)
(121522,68761)(190294,222870)(111500,78547)(4560153,445543)(190286,222406)
(6465,24802)
|
203.88 | | SPRITE::OSMAN | | Tue Nov 19 1985 14:12 | 5 |
| Does "CBACB" occur in that string ? If not, it's the shortest valid substring
that seems to never occur. It's also the *only* valid substring of length
5 that seems to never occur.
/Eric
|
203.89 | | SPRITE::OSMAN | | Tue Nov 19 1985 14:13 | 4 |
| Also, have you still never seen a backup of more than 6 places ? (or is
that no longer easily computable, given the extension technique)
/Eric
|
203.90 | | FUTBAL::GILBERT | | Tue Nov 19 1985 22:01 | 7 |
| Yes, "CBACB" occurs. The first occurence is at offset 2582135 in the string.
Yes, I had a backup of 7. I'd originally had doubts, since after this occured,
I changed the program (for other reasons) so that it misreported backup amounts,
and I wasn't sure whether it was misreporting them, or all the previous stuff
about backup amounts was wrong. Unfortunately, I've forgotten where in the
string I found this long backup amount, although I suspect it's near "CBACB".
|
203.91 | | SPRITE::OSMAN | | Wed Nov 27 1985 12:13 | 20 |
| Although I haven't broken any more records, here's an updated table showing
my current bests, still avoiding using th extension technique.
Who Number of characters time note, date
--- -------------------- ---- -----------------
SPRITE::OSMAN 10K 00:00:21 203.91 11/27/85
40K 00:04:56 203.91
BEING::POSTPISCHIL 10K 00:00:34 203.67 10/17/85
40K 00:02:48 203.81 11/4/85
1M 01:48:30 203.81 11/4/85
4M 08:52:42 203.81 11/4/85
CLT::GILBERT 10K 00:00:11 * 203.86 11/5/85
40K 00:00:28 * 203.86 11/5/85
1M 00:13:39 * 203.86 11/5/85
4M 01:15:00 * 203.86 11/5/85
* best time so far
/Eric
|
203.92 | what's the simplest substitution recipe that produces arbitrarily long NR strings ? | HANNAH::OSMAN | see HANNAH::IGLOO$:[OSMAN]ERIC.VT240 | Mon Dec 16 1991 10:25 | 46 |
|
I conjecture that if you start with a string of a's, b's, and c's in which
no substring adjacently repeats, the following substitutions will produce
a longer one:
a -> acb
b -> acab
c -> acbacab
I haven't "proved" this, however I have a simple program that starts with "a"
and makes and verifies the substitutions over and over until running out
of memory:
Good string, length = 1
Good string, length = 3
Good string, length = 14
Good string, length = 62
Good string, length = 276
Good string, length = 1228
Good string, length = 5464
Good string, length = 24312
Good string, length = 108176
Good string, length = 481328
Good string, length = 2141664
Good string, length = 9529312
Failed to malloc 42400576 bytes for new string.
Questions:
o Can someone provide a simple proof, using Gilbert's general proof in
.8, that my substrings indeed are valid generators, or else find
a counterexample of the form "start with X, and perform the
substitutions n times, and you'll have an invalid string".
o My substitutions are alot simpler than the ones Gilbert gave us in
.8. Can anyone find even simpler ones ? How simple can we be ?
A substitution must at *least* make the string longer in order to
be "interesting".
o <insert meaningful question here that relates the lexically-first
string (see previous replies for definition> with strings generated
by substitutions>
Thanks.
/Eric
|
203.93 | | ALIEN::EDP | Always mount a scratch monkey. | Mon Dec 16 1991 13:34 | 9 |
| Re .92:
Using your substitutions, "abc", which has no adajacent repeating
substring, goes to "acbacab acbacab". Also, doesn't "a" go to "acb"
which goes to "acb acbacab acab", which contains both "acb acb" and
"acab acab"? Hmm, was the string for "c" typed incorrectly?
-- edp
|
203.94 | | HANNAH::OSMAN | see HANNAH::IGLOO$:[OSMAN]ERIC.VT240 | Mon Dec 16 1991 14:54 | 11 |
|
Whoops, *embarrassment*, bug in my program. My routine called "invalid" was
coded to return 0 if string is valid, and i if string is invalid, with i
pointing to start of the substring that has an adjacent duplicate.
Unfortunately, if i was 0... I'll attempt to fix it by returning the i of
the *second* string instead of the first, which can never be 0, which will
hence disambiguate. Hang on for more news (if I indeed find some simple
generators with the fixed program).
/Eric
|
203.95 | simplest possible generators ? | HANNAH::OSMAN | see HANNAH::IGLOO$:[OSMAN]ERIC.VT240 | Tue Dec 17 1991 16:34 | 50 |
|
I wrote a program to first fill a table with the simplest valid strings that
start with "a" and end with "b", and then see which combinations of three
of them can be repetively simultaneously substituted for a,b,c and produce
a valid expanded string.
The simplest valid strings the program found are:
Filled strs[0] = "ab"
Filled strs[1] = "acb"
Filled strs[2] = "abcb"
Filled strs[3] = "acab"
Filled strs[4] = "abacb"
Filled strs[5] = "abcab"
Filled strs[6] = "acbab"
Filled strs[7] = "abacab"
Filled strs[8] = "abcacb"
Filled strs[9] = "abcbab"
Filled strs[10] = "acabcb"
Filled strs[11] = "acbcab"
Filled strs[12] = "abacbab"
Filled strs[13] = "abcbacb"
Filled strs[14] = "acabacb"
Filled strs[15] = "acbabcb"
Filled strs[16] = "acbacab"
Filled strs[17] = "acbcacb"
Filled strs[18] = "abacabcb"
Filled strs[19] = "abacbcab"
The program tried these in triplets, always starting with a seed of "a", and
used the members of the triplet to substitute for a,b,c. Whenever the program
could expand 6 times and still have a valid string, it printed out the
triplet. Only one triplet passed the test, so presumably this is the
simplest possible generator. That triplet is:
Suspected good recipe: a->abcab b->acabcb c->acbcacb
The program claims that all six possible triplets obtained by permuting the
assignments pass the 6-level test.
Of course, without more rigourous proof, it's still possible that this
generator shows a duplication at some level deeper than 6, so the same
sort of questions as before:
o Can someone apply Gilbert's proof and hence determine if this
generator is in fact valid ?
Thanks.
/Eric
|
203.96 | "Great minds ..." | CLT::TRACE::GILBERT | Ownership Obligates | Tue Dec 17 1991 18:36 | 10 |
| Nice work, Eric!
Could you apply this approach to the 'permutation-square-free' problem?
(i.e., adjacent words are not permutations of each other).
I checked note .77 to see what 'Gilbert's proof' might be. I found
the following at the bottom:
>The following morphism is what Thue had originally used:
> a -> abcab, b -> acabcb, c -> acbcacb.
|
203.97 | but proof is farther back, right ? | HANNAH::OSMAN | see HANNAH::IGLOO$:[OSMAN]ERIC.VT240 | Wed Dec 18 1991 09:11 | 16 |
|
>I checked note .77 to see what 'Gilbert's proof' might be. I found
>the following at the bottom:
But (your) Gilbert's proof isn't in .77, it's way back in .8, right?
I'd still like to see the proof explicitly applied to the shorter generator
I (and .77) specified, to make the proof a bit more clear.
Thanks.
/Eric
p.s. If anyone wants my program in order to explore the permutation problem,
I'll make you a copy for free.
|