[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference vaxaxp::alphanotes

Title:Alpha Support Conference
Notice:This is a new Alphanotes, please read note 2.2
Moderator:VAXAXP::BERNARDO
Created:Thu Jan 02 1997
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:128
Total number of notes:617

99.0. "To shift or divide?" by METALX::SWANSON (Victim of Changes) Wed May 07 1997 12:31

Hi,

If I'm using Visual C++ 4.2, and I want to shift a 64 bit int to the right
32 bits, is it quicker to divide it by 4 Gig then to shift it to the right
32 bits?

In other words, which is quicker?:

__int64  i, x = 12345678901234;

i = x / 4294967296;   // This
i = x >> 32;          // or this?

Thanks,

Ken
T.RTitleUserPersonal
Name
DateLines
99.1(x >> 32) & 0xFFFFFFFFSTAR::KLEINSORGEFred Kleinsorge, OpenVMS EngineeringWed May 07 1997 12:566
    Well, why not compile it with /machine and look at the output.  One
    would hope that they would result in the same code, but the shift is
    probably what I would use.  It is *much* more intuitive.  Also, you
    probably want it to be a uint64 unless what you really are looking for
    is a signed shift.
    
99.2WIBBIN::NOYCEPulling weeds, pickin' stonesWed May 07 1997 12:594
First of all, notice that these two statements do different things if
the input is a negative number that isn't an exact multiple of 2**32.

The shift is faster.
99.3DECC::OUELLETTEmudseason into blackfly seasonWed May 07 1997 15:173
They don't generate the same code (I wonder why).
The shift is much faster.
If you're conserned at all about performance, you should be using VC++ V5.0.
99.4DECC::OUELLETTEmudseason into blackfly seasonWed May 07 1997 15:281
I noted the code quality problem in GEMGRP::GEM-CODE-QUALITY.NOTE.
99.5METALX::SWANSONVictim of ChangesWed May 07 1997 17:1439
re: .1

>Well, why not compile it with /machine and look at the output.

I didn't know about that option!  I am using nmake on the command line...
I assume that switch works on the command line compiler?   Where does the
output go?

>the shift is probably what I would use.  It is *much* more intuitive.

I find it more intuitive too, and it's what I used.  But I was wondering if
the shift would take 32 clock cycles and the divide only 1.  

re: .2

>First of all, notice that these two statements do different things

Yes, I know.  I simplified it for entry into this notesfile.  It's actually
getting bitwise  ANDed with 0xFFFFFFFF after the shift.  I'm breaking up
64 bits into upper and lower longwords.

re: .3

>The shift is much faster.
>If you're conserned at all about performance, you should be using VC++ V5.0.

Well it's not that much of a concern really...  In fact the two 32 bit values
that I wind up with are used in a seek with SetFilePointer, and considering
how long the seek will take, I'm sure the performance hit will not be noticed!

I was mainly wondering for my own benefit as I've thought about it at
different times, but never really bothered to find out for sure.

I always use the shifts since it does make more sense!

Thanks for the info.

Ken

99.6Last machine to shift that slowly was the 780, I thinkWIBBIN::NOYCEPulling weeds, pickin' stonesWed May 07 1997 17:215
The shift takes 1 cycle, or 2 on an EV4 (21064).

The divide will take somewhere around 40 cycles, since the compiler
currently does it with a subroutine call.  Even if it were optimized,
the signed divide would be about 4 cycles longer than the shift.
99.7METALX::SWANSONVictim of ChangesWed May 07 1997 17:4616
>The shift takes 1 cycle, or 2 on an EV4 (21064)

Okay.  I guess I should have had more faith in the Alpha chip huh?  :')

>The divide will take somewhere around 40 cycles,

40 Cycles?!  I thought I heard once that starting with EV45, floating point
operations took only 6 cycles.  I guess this is wrong if integer division takes
40!

I looked at the note entered in the GEM notesfile due to my question, and found
it interesting that the 4 Gig constant is "created" in the resulting machine 
code by shifting 1 to the left 32 places!  So a 32 bit shift is done, as well
as the divide in that case!

Ken
99.8Why so slow...CADSYS::GROSSThe bug stops hereWed May 07 1997 18:319
My understanding is that the Alpha architecture does not include an
integer divide instruction. To achieve division, the software must
convert the integer to float, do the division in float, and convert
the answer back to integer. That is why it might be done by subroutine.
The compiler plays all kinds of tricks to avoid the need for division.
That is why it is likely to replace divide-by-a-power-of-two with a
shift.

Dave
99.9WIBBIN::NOYCEPulling weeds, pickin' stonesWed May 07 1997 19:016
Actually, there's no hardware that can do a 64-bit divide -- the floating-point
hardware only gives you 53 bits of precision.  So the 64-bit subroutine
is a bit more complicated...  Even for 32-bits, it turns out to usually be
faster not to use the floating-point divide on current Alpha processors,
because FP divide is relatively slow, and because it takes a long time to
get the data to and from the FP register set.
99.10DECCXL::OUELLETTEmudseason into blackfly seasonWed May 07 1997 20:332
With Visual C++ you ask for a machine code listing with -FAcs.
You'll get a file with a .cod suffix.  Use a 133 column wide editor.
99.11METALX::SWANSONVictim of ChangesThu May 08 1997 18:3614
    Thanks for the replies.
    
    .9 is interesting info. 
    
    I always thought the Alpha CPU was supposed to be extremely fast
    for floating point.  Was I wrong about that 6 cycle thing I mentioned
    in my previous note?
    
    I have run Povray on a few Alpha's and compared to 486's and pentiums
    it seemed like the Alphas were disproportionally faster for FP than
    integer stuff.   ...and this was a couple years ago.
    
    Ken
    
99.12the 6 cycles was probably add, sub, mult and friendsDECC::OUELLETTEmudseason into blackfly seasonThu May 08 1997 20:385
Divide doesn't pipeline well.  Cray left it out of his architecture
in favor of a Reciprocal Approximation instruction which does pipeline
and can be used to implement divide.  Nobody's divide is particularly
fast...  integer or floating.  Alpha's still a lot faster than x86
for a number of reasons though.
99.13Most FP ops take 4 cycles today, but divide is specialWIBBIN::NOYCEPulling weeds, pickin' stonesFri May 09 1997 09:2132
Division is fundamentally more difficult than addition or multiplication.
It also occurs less frequently in applications.  It is normally implemented
in hardware using an iterative algorithm that produces one or a few quotient
bits each cycle, after a startup time.  By adding more hardware, you can
produce more bits per cycle, and so reduce the number of cycles needed for
a full result -- but you need to trade that chip area against other uses, such
as larger caches or faster integer multiply.

EV4 (21064) takes 6 cycles for floating add, subtract, multiply, convert, move,
and conditional-move.  All these operations are pipelined: you can start a new
one in every cycle.  But single-precision divide takes about 30 cycles, and
double-precision divide takes about 60, and these are not pipelined: you can't
start a new divide until the previous divide has completed (though you can
start other operations, including other floating-point operations).

EV45 (21064A) keeps the 6-cycle pipeline, but speeds up divide, to about
20 cycles for single-precision and about 30 for double-precision.  Because
of the algorithm used, the time depends on the exact data values.

EV5 (21164) improves the basic floating-point pipeline to 4 cycles, and provides
a separate pipe for multiply, but executed divides just like EV45.  Its
derivatives, EV56 (21164 aka 21164A) and PCA56 (21164PC) are just like EV5 as
far as floating-point is concerned.

EV6 (21264) improves divide substantially, to perhaps half as many cycles as
EV5 -- still not pipelined.  It adds  a set of SQRT instructions that take
about the same time as an EV5 divide -- also not pipelined, though you can
be processing a divide and a SQRT concurrently.

Alpha's floating-point advantage over x86 comes from a combination of short
latencies, pipelining, high clock rate, and wide off-chip paths to cache and
memory.