[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

599.0. "String Distances" by CHOVAX::YOUNG (Dr. Memory...) Sat Oct 18 1986 14:03

    Define a function called S, where S returns an integer which is
    the "distance" between 2 strings of characters, equal to 
    the minimum number of characters that must be inserted in, deleted
    from, or changed in s1 to make it identical to s2.
    
    Thus:
    	XY
    	XYZW
    	XYA
    	
    All have a distance of 1 from "XYZ".
    
    Similarily, for any string T:	S (T, "") = Len (T)
    where "" denotes the null string.  Likewise, S is commutative.
    
    I think that this is similar to the notion of 'Hamming' distance,
    except for the addition of insertion/deletion as valid transformation
    operators.
    
    
    What is an efficient algorithim for calculating S (s1, s2) ?
    
    -- Barry
T.RTitleUserPersonal
Name
DateLines
599.1CLT::GILBERTeager like a childSat Oct 18 1986 17:264
    Back around 1980 or so, when DEC introduced ANSI terminals, there
    were a few papers (technical notes?) in CACM about minimal screen
    updating.  There, efficient algorithms were presented to determine
    the distance.
599.2Similar, but...CHOVAX::YOUNGDr. Memory...Sat Oct 18 1986 21:316
    This is a similar kind of distance also, however the "distance"
    of an insert/delete would not normally be that same as the "distance"
    of an overwrite in these schemes.  In the scheme I ask about, these
    two would be strictly the same.
    
    -- Barry
599.3SSDEVO::LARYSun Oct 19 1986 18:059
An efficient approximation to S is given by the algorithm used by the old
PDP-10 SRCCOM program, and more recently in the VAX DIFF program, if you
consider each character of the two strings to be a seperate "line". This
algorithm basically involves finding the first mismatch (in the CMPC sense)
and then looking for the first (meaning minimize the sum of the offsets
in the two strings) N-character match between subsequent substrings and
declare the strings back "in sync" at this point and go back to CMPC-style
matching. This algorithm is not foolproof, as its users will tell you, but
it is fairly efficient and produces good results in non-pathological cases.
599.4 Close = No_Cigar CHOVAX::YOUNGDr. Memory...Mon Oct 20 1986 14:028
    I have worked on programs like these.  As I recall, they are (1)
    heuristic (not exact), and (2) at best run in O(n^2) time, where
    n = S_distance.
    
    Extra Credit for anyone who can demonstrate that this problem defines
    a Metric Space.
    
    -- Barry
599.5Some references (I'll trade you . . .)NOBUGS::AMARTINAlan H. MartinThu Oct 30 1986 12:0121
Re .0:

See (as if you didn't know):

The String-to-String Correction Problem
Robert A. Wagner and Michael J. Fischer
JACM, V21, #1, Jan-74, pp168-173.

Re .1:

If anyone has any references on those screen update papers, I'd be
interested in having them recorded here for future reference.

Re .3:

See also:

A Linear Space Algorithm for Computing Maximal Common Subsequences
D. S. Hirschberg
CACM, V18, #6, Jan-75, pp341-343.
				/AHM