T.R | Title | User | Personal Name | Date | Lines |
---|
99.1 | (x >> 32) & 0xFFFFFFFF | STAR::KLEINSORGE | Fred Kleinsorge, OpenVMS Engineering | Wed May 07 1997 12:56 | 6 |
| Well, why not compile it with /machine and look at the output. One
would hope that they would result in the same code, but the shift is
probably what I would use. It is *much* more intuitive. Also, you
probably want it to be a uint64 unless what you really are looking for
is a signed shift.
|
99.2 | | WIBBIN::NOYCE | Pulling weeds, pickin' stones | Wed May 07 1997 12:59 | 4 |
| First of all, notice that these two statements do different things if
the input is a negative number that isn't an exact multiple of 2**32.
The shift is faster.
|
99.3 | | DECC::OUELLETTE | mudseason into blackfly season | Wed May 07 1997 15:17 | 3 |
| They don't generate the same code (I wonder why).
The shift is much faster.
If you're conserned at all about performance, you should be using VC++ V5.0.
|
99.4 | | DECC::OUELLETTE | mudseason into blackfly season | Wed May 07 1997 15:28 | 1 |
| I noted the code quality problem in GEMGRP::GEM-CODE-QUALITY.NOTE.
|
99.5 | | METALX::SWANSON | Victim of Changes | Wed May 07 1997 17:14 | 39 |
| re: .1
>Well, why not compile it with /machine and look at the output.
I didn't know about that option! I am using nmake on the command line...
I assume that switch works on the command line compiler? Where does the
output go?
>the shift is probably what I would use. It is *much* more intuitive.
I find it more intuitive too, and it's what I used. But I was wondering if
the shift would take 32 clock cycles and the divide only 1.
re: .2
>First of all, notice that these two statements do different things
Yes, I know. I simplified it for entry into this notesfile. It's actually
getting bitwise ANDed with 0xFFFFFFFF after the shift. I'm breaking up
64 bits into upper and lower longwords.
re: .3
>The shift is much faster.
>If you're conserned at all about performance, you should be using VC++ V5.0.
Well it's not that much of a concern really... In fact the two 32 bit values
that I wind up with are used in a seek with SetFilePointer, and considering
how long the seek will take, I'm sure the performance hit will not be noticed!
I was mainly wondering for my own benefit as I've thought about it at
different times, but never really bothered to find out for sure.
I always use the shifts since it does make more sense!
Thanks for the info.
Ken
|
99.6 | Last machine to shift that slowly was the 780, I think | WIBBIN::NOYCE | Pulling weeds, pickin' stones | Wed May 07 1997 17:21 | 5 |
| The shift takes 1 cycle, or 2 on an EV4 (21064).
The divide will take somewhere around 40 cycles, since the compiler
currently does it with a subroutine call. Even if it were optimized,
the signed divide would be about 4 cycles longer than the shift.
|
99.7 | | METALX::SWANSON | Victim of Changes | Wed May 07 1997 17:46 | 16 |
| >The shift takes 1 cycle, or 2 on an EV4 (21064)
Okay. I guess I should have had more faith in the Alpha chip huh? :')
>The divide will take somewhere around 40 cycles,
40 Cycles?! I thought I heard once that starting with EV45, floating point
operations took only 6 cycles. I guess this is wrong if integer division takes
40!
I looked at the note entered in the GEM notesfile due to my question, and found
it interesting that the 4 Gig constant is "created" in the resulting machine
code by shifting 1 to the left 32 places! So a 32 bit shift is done, as well
as the divide in that case!
Ken
|
99.8 | Why so slow... | CADSYS::GROSS | The bug stops here | Wed May 07 1997 18:31 | 9 |
| My understanding is that the Alpha architecture does not include an
integer divide instruction. To achieve division, the software must
convert the integer to float, do the division in float, and convert
the answer back to integer. That is why it might be done by subroutine.
The compiler plays all kinds of tricks to avoid the need for division.
That is why it is likely to replace divide-by-a-power-of-two with a
shift.
Dave
|
99.9 | | WIBBIN::NOYCE | Pulling weeds, pickin' stones | Wed May 07 1997 19:01 | 6 |
| Actually, there's no hardware that can do a 64-bit divide -- the floating-point
hardware only gives you 53 bits of precision. So the 64-bit subroutine
is a bit more complicated... Even for 32-bits, it turns out to usually be
faster not to use the floating-point divide on current Alpha processors,
because FP divide is relatively slow, and because it takes a long time to
get the data to and from the FP register set.
|
99.10 | | DECCXL::OUELLETTE | mudseason into blackfly season | Wed May 07 1997 20:33 | 2 |
| With Visual C++ you ask for a machine code listing with -FAcs.
You'll get a file with a .cod suffix. Use a 133 column wide editor.
|
99.11 | | METALX::SWANSON | Victim of Changes | Thu May 08 1997 18:36 | 14 |
| Thanks for the replies.
.9 is interesting info.
I always thought the Alpha CPU was supposed to be extremely fast
for floating point. Was I wrong about that 6 cycle thing I mentioned
in my previous note?
I have run Povray on a few Alpha's and compared to 486's and pentiums
it seemed like the Alphas were disproportionally faster for FP than
integer stuff. ...and this was a couple years ago.
Ken
|
99.12 | the 6 cycles was probably add, sub, mult and friends | DECC::OUELLETTE | mudseason into blackfly season | Thu May 08 1997 20:38 | 5 |
| Divide doesn't pipeline well. Cray left it out of his architecture
in favor of a Reciprocal Approximation instruction which does pipeline
and can be used to implement divide. Nobody's divide is particularly
fast... integer or floating. Alpha's still a lot faster than x86
for a number of reasons though.
|
99.13 | Most FP ops take 4 cycles today, but divide is special | WIBBIN::NOYCE | Pulling weeds, pickin' stones | Fri May 09 1997 09:21 | 32 |
| Division is fundamentally more difficult than addition or multiplication.
It also occurs less frequently in applications. It is normally implemented
in hardware using an iterative algorithm that produces one or a few quotient
bits each cycle, after a startup time. By adding more hardware, you can
produce more bits per cycle, and so reduce the number of cycles needed for
a full result -- but you need to trade that chip area against other uses, such
as larger caches or faster integer multiply.
EV4 (21064) takes 6 cycles for floating add, subtract, multiply, convert, move,
and conditional-move. All these operations are pipelined: you can start a new
one in every cycle. But single-precision divide takes about 30 cycles, and
double-precision divide takes about 60, and these are not pipelined: you can't
start a new divide until the previous divide has completed (though you can
start other operations, including other floating-point operations).
EV45 (21064A) keeps the 6-cycle pipeline, but speeds up divide, to about
20 cycles for single-precision and about 30 for double-precision. Because
of the algorithm used, the time depends on the exact data values.
EV5 (21164) improves the basic floating-point pipeline to 4 cycles, and provides
a separate pipe for multiply, but executed divides just like EV45. Its
derivatives, EV56 (21164 aka 21164A) and PCA56 (21164PC) are just like EV5 as
far as floating-point is concerned.
EV6 (21264) improves divide substantially, to perhaps half as many cycles as
EV5 -- still not pipelined. It adds a set of SQRT instructions that take
about the same time as an EV5 divide -- also not pipelined, though you can
be processing a divide and a SQRT concurrently.
Alpha's floating-point advantage over x86 comes from a combination of short
latencies, pipelining, high clock rate, and wide off-chip paths to cache and
memory.
|