| I think it's fair to say that the market for this has never
been large enough to really justify a full hardware implementation,
H_Floating notwithstanding. As software implementations go, this
appears to be a particularly fast one, but of course it can never
compete with real*8 or any other hardware-implemented mode.
My own view on this is that if speed becomes a serious problem
for the customer, their time will be spent well in figuring out
how to use real*8 instead of real*16 - not always possible, but
often a sloppy real*16 algorithm might be replaced with a clever
real*8 one...
Cheers!
Dave Eklund
|
| Reaching back into a predecessor FORTRAN notes files, I found the following
notes about our REAL*16 implementation on Alpha.
cw hobbs reports that some very demanding CERN users were amazed at the speed.
/Rich Grove
PS: When you look at the absolute timings, notice that the dates are 1994
so most of this stuff is on 200Mz EV4. It should be considerably faster on
your 500Mhz or 622Mhz EV56.
Dwight Manley's note discusses the merits of an "E-float" implementation,
which is waht the IBM RS6000 does. Note that X-float compares very favorably
to the semi-hardware E-float.
<<< TURRIS::DISK$NOTES_PACK:[NOTES$LIBRARY]DEC_FORTRAN_ALPHA.NOTE;1 >>>
-< DEC Fortran on ALPHA >-
================================================================================
Note 1693.0 REAL*16 - why is it so fast? 2 replies
TLE::WHITLOCK "Stan Whitlock" 17 lines 1-AUG-1994 11:09
--------------------------------------------------------------------------------
================================================================================
Note 1689.4 Problem with extrnal function declaration 4 of 5
CERN::HOBBS "Budweiser - official embarrassment of " 10 lines 1-AUG-1994 08:04
-< REAL*16 - why is it so fast? >-
--------------------------------------------------------------------------------
I just got stopped by someone at CERN who wanted to comment about how fast
the REAL*16 support is in VMS 6.1. Apparently, he had to double-check his
figures because it seemed too good to be true.
Is there any information available to explain what tricks are pulled in the
RTL to support X-float?
I'd like to toot our own horn a bit here ;-)
-cw
================================================================================
WIBBIN::NOYCE "DEC 21064-200DX5 : 138 SPECint @ $36K" 5 lines 1-AUG-1994 09:56
--------------------------------------------------------------------------------
Well, having 64-bit integer registers and 64-bit integer arithmetic
helps a lot. The routines for the fundamental operations add, sub, mul, div
are carefully hand-coded, with a great deal of overlap, especially in the
multiply routine. The routines receive arguments and return results in
registers, not memory.
================================================================================
GEMGRP::GROVE 7 lines 5-AUG-1994 15:58
--------------------------------------------------------------------------------
See note 1011, esp 1011.6-1011.8
The RTL primitives that Steve Root wrote are really outstanding,
and the compiled code does a good job on register allocation and
very lightweight linkages to the RTL routines.
/Rich
================================================================================
Note 1011.3 REAL*16 White Paper 3 of 10
NICCTR::MANLEY 628 lines 4-AUG-1993 14:00
-< Enhanced Precision vs Extended Precision >-
--------------------------------------------------------------------------------
I encourage the DEC-FORTRAN group to seriously consider supporting both IEEE
extended precision floating point (this is being done) and IBM RS/6000 enhanced
precision floating point (this is not being done). Perhaps this could be
provided by KAP-FORTRAN. Kuck and Associates, Inc. has provided exactly this
sort of REAL*16 support for other hardware vendors in the past. Now let me
explain the motivation for my proposal.
Clearly, IEEE 754 extended precision floating point support is very important.
Our customers expect it and HP provides it. We must support it to meet customer
needs and to compete with HP. However, IEEE extended precision support carries
a large performance penalty. Extended precision floating point will not be
competitive with IBM's enhanced precision floating point. To compete with IBM,
we must also support enhanced precision floating point.
Enhanced precision floating point operands are represented by pairs of double
precision floating point operands. Arithmetic operations are carried out using
double precision floating point instructions. Either IEEE or VAX G_FLOAT format
double precision operands may be used to represent enhanced floating point
operands. The exponent field size of an enhanced floating point operand is
identical to that of a double precision floating point operand. The effective
significand field of enhanced precision floating point is nearly twice that of
double precision floating point. Thus, enhanced precision floating point is
much more accurate than double precision floating point. Enhanced precision
floating point is not IEEE 754 compliant.
Extended precision floating point, on the other hand, is IEEE 754 compliant.
With extended precision floating point, both exponent and significand field
sizes are larger than they are for enhanced precision floating point. Thus
accuracy is improved and range is extended. Unfortunately, there is very
little support in the Alpha architecture to aid in making software emulation
of IEEE extended precision perform well. We need an alternative.
The remainder of the note contains four subroutines and two programs. They
support enhanced precision floating point arithmetic. The subroutines
perform enhanced floating point add, subtract, multiply, and divide arithmetic
operations. The first program tests the accuracy of enhanced precision floating
point arithmetic relative to H_FLOAT arithmetic. Run it to see that most folks
don't need H_FLOAT for accuracy! A second program times enhanced floating point
arithmetic operations. Run it to compare enhanced floating point performance
to that of both H_FLOAT and G_FLOAT.
- Dwight -
================================================================================
Note 1011.6 REAL*16 White Paper 6 of 10
HPCGRP::MANLEY 50 lines 26-APR-1994 19:58
-< REAL*16 - Nice Job! >-
--------------------------------------------------------------------------------
Re: .4,.5
I just completed a quick performance sanity test comparing the REAL*16 and
unpipelined REAL*8 basic operations (Add, Subtract, Multiply, and Divide).
I also compared the REAL*16 operations to a FORTRAN implementation using
pairs of floating point values ala IBM.
On our VAX 6000, the same FORTRAN code beats the pants off ALL H_Floating
operations. I expected the worst!
Guess what, I'm pleasantly surprised!
The REAL*16 primitives on Alpha have about the same performance for Add,
Subtract, and Multiply as the FORTRAN routines. The REAL*16 divide operation
is about 3 time faster than the Newton-Raphson reciprocal (quick and dirty)
approach.
The performance results follow (E_Float uses two 64 bit floats):
Timing --- 1000000 E-Precision Arithmetic Operations
Time for E_Float Add Operations 0.5397949 secs.
Time for X_Float Add Operations 0.8000488 secs.
Time for G_Float Add Operations 4.9804688E-02 secs.
Time for E_Float Sub Operations 0.5798340 secs.
Time for X_Float Sub Operations 0.6699219 secs.
Time for G_Float Sub Operations 5.0048828E-02 secs.
Time for E_Float Mul Operations 0.9299316 secs.
Time for X_Float Mul Operations 0.8598633 secs.
Time for G_Float Mul Operations 6.0058594E-02 secs.
Time for E_Float Div Operations 7.009766 secs.
Time for X_Float Div Operations 2.300049 secs.
Time for G_Float Div Operations 0.4399414 secs.
You've done a nice job, especially on the divide operation. (I heard
a rumor that there's some "scrabble magic" imbedded in that code.)
- Dwight -
================================================================================
Note 1011.7 REAL*16 White Paper 7 of 10
GEMGRP::GROVE 12 lines 27-APR-1994 08:35
-< Roll the credits! >-
--------------------------------------------------------------------------------
Dwight, thanks for the measurements and the posting.
Steve Root (from the VSSAD group in Hudson) and Lucy Hamnett (GEM)
did a great job on the REAL*16 implementation:
Steve did the high-performance arithmetic primitives
Lucy was project leader for the whole effort, and designed
and implemented the GEM compiler support for X-float
This is a really nice piece of work!
Rich Grove
================================================================================
Note 1011.8 REAL*16 White Paper 8 of 10
AD::ROOT 36 lines 3-MAY-1994 23:56
-< updated perf. data >-
--------------------------------------------------------------------------------
UPDATED PERFORMANCE DATA
I made some changes to Dwight's benchmark to reduce unnecessary
overhead. The results for 10^7 operations are given below. The
times make sense, except that the X_Float div to mul ratio seems
high, and E_Float Mul became more expensive.
(Note, that the G_Float operations except for div, 'should' check
in at .4 seconds.)
These numbers for X_Float and E_Float are slightly optimistic,
compared to what a customer may achieve in a real job, in that
I_Cache traffic is minimized. The E_Float numbers are slightly
optimistic in that they incur no call overhead. The X_Float add/sub
numbers are slightly optimistic in that exponent mispredicts are
probably a little low compared to a real job. The X_Float sub
numbers are probably a little low in that fewer complete
normalizations are incurred.
Time for E_Float Add Operations 4.750000 secs.
Time for X_Float Add Operations 4.260000 secs.
Time for G_Float Add Operations 0.3999996 secs.
Time for E_Float Sub Operations 4.780000 secs.
Time for X_Float Sub Operations 4.380001 secs.
Time for G_Float Sub Operations 0.4099998 secs.
Time for E_Float Mul Operations 12.63000 secs.
Time for X_Float Mul Operations 7.429998 secs.
Time for G_Float Mul Operations 0.4000015 secs.
Time for E_Float Div Operations 69.91000 secs.
Time for X_Float Div Operations 21.80000 secs.
Time for G_Float Div Operations 4.240005 secs.
(Code available upon request.)
|
| Hello,
Cusotmer suggests the following:
Regarding the 16-byte mod, I came up with a bitwise multiply-
and-mod code last night which works, have enclosed it below
(one could also use similar for modding the product of two
standard 4-byte integers, but this is less crucial since the
Alpha has the 8-byte integer type). A fast machine-code implementation
of similar would be great; it might even be considered as an
extension to the F90 intrinsic function library on the DEC F90
compiler:
program bitwise_mod
!...performs a bitwise-multiply-and-mod on 3 integers x,y,z
! to obtain x*y mod z without risking integer overflow.
integer*8 :: x,y,z,sum
logical :: flag
print*,'enter x,y,z'
read*,x,y,z
if(x>z)x=mod(x,z)
if(y>z)y=mod(y,z)
! this next line is for comparison only, and only works if x*y
! CAN be stored in in 8-byte integer...
print*,'exact mod =',mod(x*y,z)
sum=0
if(btest(y,0))sum=x
do
y=y/2 ! could also use y=ishft(y,-1) here, if it's faster
if(y==0)exit
flag=btest(y,0)
x=ishft(x,1)
if(x>z)x=x-z
if(flag)then
sum=sum+x
if(sum>z)sum=sum-z
endif
enddo
print*,'x*y mod z =',sum
end program bitwise_mod
|