[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference vaxaxp::alphanotes

Title:	Alpha Support Conference
Notice:	This is a new Alphanotes, please read note 2.2
Moderator:	VAXAXP::BERNARDO

Created:	Thu Jan 02 1997
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	128
Total number of notes:	617

99.0. "To shift or divide?" by METALX::SWANSON (Victim of Changes) Wed May 07 1997 12:31

Hi,

If I'm using Visual C++ 4.2, and I want to shift a 64 bit int to the right
32 bits, is it quicker to divide it by 4 Gig then to shift it to the right
32 bits?

In other words, which is quicker?:

__int64  i, x = 12345678901234;

i = x / 4294967296;   // This
i = x >> 32;          // or this?

Thanks,

Ken

T.R	Title	User	Personal Name	Date	Lines
99.1	(x >> 32) & 0xFFFFFFFF	STAR::KLEINSORGE	Fred Kleinsorge, OpenVMS Engineering	`Wed May 07 1997 12:56`	6
	Well, why not compile it with /machine and look at the output. One would hope that they would result in the same code, but the shift is probably what I would use. It is much more intuitive. Also, you probably want it to be a uint64 unless what you really are looking for is a signed shift.
99.2		WIBBIN::NOYCE	Pulling weeds, pickin' stones	`Wed May 07 1997 12:59`	4
	First of all, notice that these two statements do different things if the input is a negative number that isn't an exact multiple of 2**32. The shift is faster.
99.3		DECC::OUELLETTE	mudseason into blackfly season	`Wed May 07 1997 15:17`	3
	They don't generate the same code (I wonder why). The shift is much faster. If you're conserned at all about performance, you should be using VC++ V5.0.
99.4		DECC::OUELLETTE	mudseason into blackfly season	`Wed May 07 1997 15:28`	1
	I noted the code quality problem in GEMGRP::GEM-CODE-QUALITY.NOTE.
99.5		METALX::SWANSON	Victim of Changes	`Wed May 07 1997 17:14`	39
	re: .1 >Well, why not compile it with /machine and look at the output. I didn't know about that option! I am using nmake on the command line... I assume that switch works on the command line compiler? Where does the output go? >the shift is probably what I would use. It is much more intuitive. I find it more intuitive too, and it's what I used. But I was wondering if the shift would take 32 clock cycles and the divide only 1. re: .2 >First of all, notice that these two statements do different things Yes, I know. I simplified it for entry into this notesfile. It's actually getting bitwise ANDed with 0xFFFFFFFF after the shift. I'm breaking up 64 bits into upper and lower longwords. re: .3 >The shift is much faster. >If you're conserned at all about performance, you should be using VC++ V5.0. Well it's not that much of a concern really... In fact the two 32 bit values that I wind up with are used in a seek with SetFilePointer, and considering how long the seek will take, I'm sure the performance hit will not be noticed! I was mainly wondering for my own benefit as I've thought about it at different times, but never really bothered to find out for sure. I always use the shifts since it does make more sense! Thanks for the info. Ken
99.6	Last machine to shift that slowly was the 780, I think	WIBBIN::NOYCE	Pulling weeds, pickin' stones	`Wed May 07 1997 17:21`	5
	The shift takes 1 cycle, or 2 on an EV4 (21064). The divide will take somewhere around 40 cycles, since the compiler currently does it with a subroutine call. Even if it were optimized, the signed divide would be about 4 cycles longer than the shift.
99.7		METALX::SWANSON	Victim of Changes	`Wed May 07 1997 17:46`	16
	>The shift takes 1 cycle, or 2 on an EV4 (21064) Okay. I guess I should have had more faith in the Alpha chip huh? :') >The divide will take somewhere around 40 cycles, 40 Cycles?! I thought I heard once that starting with EV45, floating point operations took only 6 cycles. I guess this is wrong if integer division takes 40! I looked at the note entered in the GEM notesfile due to my question, and found it interesting that the 4 Gig constant is "created" in the resulting machine code by shifting 1 to the left 32 places! So a 32 bit shift is done, as well as the divide in that case! Ken
99.8	Why so slow...	CADSYS::GROSS	The bug stops here	`Wed May 07 1997 18:31`	9
	My understanding is that the Alpha architecture does not include an integer divide instruction. To achieve division, the software must convert the integer to float, do the division in float, and convert the answer back to integer. That is why it might be done by subroutine. The compiler plays all kinds of tricks to avoid the need for division. That is why it is likely to replace divide-by-a-power-of-two with a shift. Dave
99.9		WIBBIN::NOYCE	Pulling weeds, pickin' stones	`Wed May 07 1997 19:01`	6
	Actually, there's no hardware that can do a 64-bit divide -- the floating-point hardware only gives you 53 bits of precision. So the 64-bit subroutine is a bit more complicated... Even for 32-bits, it turns out to usually be faster not to use the floating-point divide on current Alpha processors, because FP divide is relatively slow, and because it takes a long time to get the data to and from the FP register set.
99.10		DECCXL::OUELLETTE	mudseason into blackfly season	`Wed May 07 1997 20:33`	2
	With Visual C++ you ask for a machine code listing with -FAcs. You'll get a file with a .cod suffix. Use a 133 column wide editor.
99.11		METALX::SWANSON	Victim of Changes	`Thu May 08 1997 18:36`	14
	Thanks for the replies. .9 is interesting info. I always thought the Alpha CPU was supposed to be extremely fast for floating point. Was I wrong about that 6 cycle thing I mentioned in my previous note? I have run Povray on a few Alpha's and compared to 486's and pentiums it seemed like the Alphas were disproportionally faster for FP than integer stuff. ...and this was a couple years ago. Ken
99.12	the 6 cycles was probably add, sub, mult and friends	DECC::OUELLETTE	mudseason into blackfly season	`Thu May 08 1997 20:38`	5
	Divide doesn't pipeline well. Cray left it out of his architecture in favor of a Reciprocal Approximation instruction which does pipeline and can be used to implement divide. Nobody's divide is particularly fast... integer or floating. Alpha's still a lot faster than x86 for a number of reasons though.
99.13	Most FP ops take 4 cycles today, but divide is special	WIBBIN::NOYCE	Pulling weeds, pickin' stones	`Fri May 09 1997 09:21`	32
	Division is fundamentally more difficult than addition or multiplication. It also occurs less frequently in applications. It is normally implemented in hardware using an iterative algorithm that produces one or a few quotient bits each cycle, after a startup time. By adding more hardware, you can produce more bits per cycle, and so reduce the number of cycles needed for a full result -- but you need to trade that chip area against other uses, such as larger caches or faster integer multiply. EV4 (21064) takes 6 cycles for floating add, subtract, multiply, convert, move, and conditional-move. All these operations are pipelined: you can start a new one in every cycle. But single-precision divide takes about 30 cycles, and double-precision divide takes about 60, and these are not pipelined: you can't start a new divide until the previous divide has completed (though you can start other operations, including other floating-point operations). EV45 (21064A) keeps the 6-cycle pipeline, but speeds up divide, to about 20 cycles for single-precision and about 30 for double-precision. Because of the algorithm used, the time depends on the exact data values. EV5 (21164) improves the basic floating-point pipeline to 4 cycles, and provides a separate pipe for multiply, but executed divides just like EV45. Its derivatives, EV56 (21164 aka 21164A) and PCA56 (21164PC) are just like EV5 as far as floating-point is concerned. EV6 (21264) improves divide substantially, to perhaps half as many cycles as EV5 -- still not pipelined. It adds a set of SQRT instructions that take about the same time as an EV5 divide -- also not pipelined, though you can be processing a divide and a SQRT concurrently. Alpha's floating-point advantage over x86 comes from a combination of short latencies, pipelining, high clock rate, and wide off-chip paths to cache and memory.