[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::digital_unix

Title:DIGITAL UNIX(FORMERLY KNOWN AS DEC OSF/1)
Notice:Welcome to the Digital UNIX Conference
Moderator:SMURF::DENHAM
Created:Thu Mar 16 1995
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:10068
Total number of notes:35879

9965.0. "Numeric sort not working" by BIGUN::nessus.cao.dec.com::Mayne (Meanwhile, back on Earth...) Wed May 28 1997 00:41

I'm trying to sort a list of TCP/IP addresses (UNIX versions 3.2C and 4.0A):

# cat y.y
168.132.11.1
168.132.11.25
168.132.11.3
168.132.128.28
168.132.13.10
168.132.13.2
# sort -t . -k 1n -k 2n -k 3n -k 4n y.y
168.132.11.1
168.132.11.25
168.132.11.3
168.132.128.28
168.132.13.10
168.132.13.2

What I should be seeing is:

168.132.11.1
168.132.11.3
168.132.11.25
168.132.13.2
168.132.13.10
168.132.128.28

I'm obviously missing something trivial, in that the sort is lexographic rather 
than numeric, even though I specified n on the key modifiers. (Putting -n at the 
beginning doesn't help either.) What am I doing wrong?

PJDM
T.RTitleUserPersonal
Name
DateLines
9965.1Decimal point is a decimal point?MARVIN::GOUGHRaoul GoughWed May 28 1997 10:2610
Looks to me like sort doesn't work with "." as a field separator (maybe
because it's a decmial point?)  Anyway, you could work around it like
this:

sed 's/\./:/g' y.y | sort -t: -k n1 -k n2 -k n3 -k n4 | sed 's/:/./g'

(Couldn't resist looking at this one, I musn't be busy enough)

Ray.
9965.2yep - it is confusedSMURF::WENDYWed May 28 1997 11:4412
    I was curious too and agree that the problem is that sort is getting
    confused trying to do a numeric sort, which relies on the decimal/radix
    symbol and your use of the '." as the separator. If, for example, you 
    set LANG=es_ES.ISO8859-1, the Spanish locale that has a ',' as a radix
    symbol and use your sort command as is you get the expected results. 
    I think you will need to process the file to change the separator or 
    ues a different locale - but make sure you know what the decimal point
    is. You can tell by running the command "locale -k LC_NUMERIC".
    
    wendy
    UEG/I18N 
    
9965.3It's not the decimal point (exactly)!TEACH::SMITTYDaylight come an' me wan' go home!Wed May 28 1997 14:0559
    The problem is not with the decimal point -- at least not as described
    in the previous replies.  The command you are attempting is actually a
    victim of two problems.  The first is the periods in the numbers.  A
    numeric value is not allowed to have more than one period in it.  From
    the man pages for sort:
    
    "-n	[XPG4-UNIX]  Sorts any initial numeric strings (including regular
    	expressions consisting of optional spaces, optional dashes, and
    	zero (0) or more digits with optional radix character and thousands
    	separator, as defined by the current locale) by arithmetic value. 
    	An empty digit string is treated as zero; leading zeros and signs
    	on zeros do not affect ordering.  Only one period (.) can be used
    	in numeric strings.  All subsequent periods (.) and any character
    	to the right of the period (.) will be ignored."
    
    The second problem is the way the keys are specified.  The notation
    "-k1n" describes a key that starts with the first word on the line and
    ends with the last character.  Again from the sort man pages:
    
    	"[XPG4-UNIX]  The format of a key field definition is as follows:
    		field_start[type][,field_end[type]]
    	where the field_start and field_end arguments define a key field
    	that is restricted to a portion of the line, and type is a modifier
    	specified by b, d, f, i, n, or r...
    
    	...A missing field_end argument means the last character of the
    	line."
    
    The unfortunate combination of these two things is the cause of the
    strange behavior that we are all seeing.  I'm guessing here, but I
    suspect that sort processes the first key numerically, assumes all the
    periods it sees are part of the number, ignores everything from the
    second period to the end of the line, orders all the records as it
    found them (since they have all been evaluated as identical 168.132
    values), and somehow never correctly identifies the existence of the
    other keys.
    
    By isolating each field, however, using the XPG4-UNIX syntax
    "-k<start>,<end>", the problems go away.  Here's an example from my
    system:
    
    ahem- cat numbers.txt
    168.132.11.1
    168.132.11.25
    168.132.11.3
    168.132.128.28
    168.132.13.10
    168.132.13.2
    ahem- sort -t . -k1,1n -k2,2n -k3,3n -k4,4n numbers.txt
    168.132.11.1
    168.132.11.3
    168.132.11.25
    168.132.13.2
    168.132.13.10
    168.132.128.28
    
    Regards,
    
    Bill
9965.4It's doing what the spec saysWIBBIN::NOYCEPulling weeds, pickin&#039; stonesWed May 28 1997 19:2913
> and somehow never correctly identifies the existence of the
>     other keys.

No, you don't have to assume that it forgets the other keys.
Instead, it sorts the file as if it looks like

168.132  132.11   11.1    1
168.132  132.11   11.25  25
168.132  132.11   11.3    3
168.132  132.128 128.28  28
168.132  132.13   13.10  10
168.132  132.13   13.2    2

9965.5BIGUN::nessus.cao.dec.com::MayneMeanwhile, back on Earth...Wed May 28 1997 20:214
So what does the spec say about using "." as a separator? (Hey, it's not *my* 
use of the "." that's stuffing things up. 8-)

PJDM
9965.6BIGUN::nessus.cao.dec.com::MayneMeanwhile, back on Earth...Thu May 29 1997 19:5318
If sort works as in .4, and "all subsequent periods (.) and any character to the 
right of the period (.) will be ignored" (according to the man page), how is the 
following explained?

# cat a.a
1.2
1.22
1.3
# sort -t . -k 2n a.a
1.2
1.3
1.22
# sort -t . -k 1n -k 2n a.a
1.2
1.22
1.3

PJDM
9965.7IOSG::MARSHALLFri May 30 1997 09:4328
>> how is the following explained?

The order of keys is significant.  Sorting on a later key won't undo the
ordering obtained by earlier sorts.  The sorting done by the later key only
affects consecutive records where the previous keys had identical values.

From the sort(1) man page:

      When there are multiple key fields, later keys are compared only after
      all earlier keys compare as equal.

This is generally desirable behaviour, enabling you to sort, for example,
forenames within surnames, or in this case subnets within a network.

As the -k1n had sorted the (whole) of each record numerically, and all were
distinct, the -k2n effectively did nothing.  As a previous reply observed, to
obtain the sorting required by the basenoter you have to tell the sort command
where each key value ends wrt the defined separator, eg -k1,1n.

So -k1n means:   the key starts with the first field and extends to the end of
                 the line; a '.' within this range will be treated as decimal
                 point, not as a field separator

   -k1,1n means: the key field starts and ends with the first field, enabling
                 the use of '.' as field separator to override the use of '.'
                 as a decimal point within a numeric field.

Scott