[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

599.0. "String Distances" by CHOVAX::YOUNG (Dr. Memory...) Sat Oct 18 1986 13:03

    Define a function called S, where S returns an integer which is
    the "distance" between 2 strings of characters, equal to 
    the minimum number of characters that must be inserted in, deleted
    from, or changed in s1 to make it identical to s2.
    
    Thus:
    	XY
    	XYZW
    	XYA
    	
    All have a distance of 1 from "XYZ".
    
    Similarily, for any string T:	S (T, "") = Len (T)
    where "" denotes the null string.  Likewise, S is commutative.
    
    I think that this is similar to the notion of 'Hamming' distance,
    except for the addition of insertion/deletion as valid transformation
    operators.
    
    
    What is an efficient algorithim for calculating S (s1, s2) ?
    
    -- Barry

T.R	Title	User	Personal Name	Date	Lines
599.1		CLT::GILBERT	eager like a child	`Sat Oct 18 1986 16:26`	4
	Back around 1980 or so, when DEC introduced ANSI terminals, there were a few papers (technical notes?) in CACM about minimal screen updating. There, efficient algorithms were presented to determine the distance.
599.2	Similar, but...	CHOVAX::YOUNG	Dr. Memory...	`Sat Oct 18 1986 20:31`	6
	This is a similar kind of distance also, however the "distance" of an insert/delete would not normally be that same as the "distance" of an overwrite in these schemes. In the scheme I ask about, these two would be strictly the same. -- Barry
599.3		SSDEVO::LARY		`Sun Oct 19 1986 17:05`	9
	An efficient approximation to S is given by the algorithm used by the old PDP-10 SRCCOM program, and more recently in the VAX DIFF program, if you consider each character of the two strings to be a seperate "line". This algorithm basically involves finding the first mismatch (in the CMPC sense) and then looking for the first (meaning minimize the sum of the offsets in the two strings) N-character match between subsequent substrings and declare the strings back "in sync" at this point and go back to CMPC-style matching. This algorithm is not foolproof, as its users will tell you, but it is fairly efficient and produces good results in non-pathological cases.
599.4	Close = No_Cigar	CHOVAX::YOUNG	Dr. Memory...	`Mon Oct 20 1986 13:02`	8
	I have worked on programs like these. As I recall, they are (1) heuristic (not exact), and (2) at best run in O(n^2) time, where n = S_distance. Extra Credit for anyone who can demonstrate that this problem defines a Metric Space. -- Barry
599.5	Some references (I'll trade you . . .)	NOBUGS::AMARTIN	Alan H. Martin	`Thu Oct 30 1986 12:01`	21
	Re .0: See (as if you didn't know): The String-to-String Correction Problem Robert A. Wagner and Michael J. Fischer JACM, V21, #1, Jan-74, pp168-173. Re .1: If anyone has any references on those screen update papers, I'd be interested in having them recorded here for future reference. Re .3: See also: A Linear Space Algorithm for Computing Maximal Common Subsequences D. S. Hirschberg CACM, V18, #6, Jan-75, pp341-343. /AHM