[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::fortran

Title:Digital Fortran
Notice:Read notes 1.* for important information
Moderator:QUARK::LIONEL
Created:Thu Jun 01 1995
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1333
Total number of notes:6734

1251.0. "Real*16 Performance?" by RHETT::HALETKY () Mon Apr 07 1997 13:58

    Hello,
    
    We have a custoemr with questions about the performance of real*16 and
    if there is anyway to speed them up?
    
    
    Best regards,\
    Ed Haletky
    Digital CSC
T.RTitleUserPersonal
Name
DateLines
1251.1QUARK::LIONELFree advice is worth every centMon Apr 07 1997 14:088
What are the questions?  What platform are we talking about?

On most VAX and all Alpha systems, REAL*16 is software-emulated.  The Alpha
support is through specially optimized routines and is really quite good,
all things considered.  There's no way to make it faster.  The VAX support
is through instruction emulation and it's not speedy.

					Steve
1251.2Benchmarks for real*16?RHETT::HALETKYTue Apr 08 1997 13:558
    
    The system is an ALPHA. The custoemr claims tha the emulation is much
    much slower than real*8. Are there any benchmarks?
    
    
    Best regards,
    Ed Haletky
    Digital CSC
1251.3QUARK::LIONELFree advice is worth every centTue Apr 08 1997 14:105
Yes, it is much slower than REAL*8, since the latter is done in hardware. No,
we don't have benchmarks - we never claimed REAL*16 was fast.  It is really
very fast for a software implementation.

				Steve
1251.4What would they consider to be fast?WIBBIN::NOYCEPulling weeds, pickin' stonesTue Apr 08 1997 16:041
Are they comparing it to a competitor's REAL*16?  Which one?
1251.5TLE::EKLUNDAlways smiling on the inside!Tue Apr 08 1997 16:0715
    	I think it's fair to say that the market for this has never
    been large enough to really justify a full hardware implementation,
    H_Floating notwithstanding.  As software implementations go, this
    appears to be a particularly fast one, but of course it can never
    compete with real*8 or any other hardware-implemented mode.
    
    	My own view on this is that if speed becomes a serious problem
    for the customer, their time will be spent well in figuring out
    how to use real*8 instead of real*16 - not always possible, but
    often a sloppy real*16 algorithm might be replaced with a clever
    real*8 one...
    
    Cheers!
    Dave Eklund
    
1251.6Alpha REAL*16 is very fastGEMEVN::GROVETue Apr 08 1997 19:17224
Reaching back into a predecessor FORTRAN notes files, I found the following
notes about our REAL*16 implementation on Alpha.

cw hobbs reports that some very demanding CERN users were amazed at the speed.

/Rich Grove

PS: When you look at the absolute timings, notice that the dates are 1994
so most of this stuff is on 200Mz EV4. It should be considerably faster on
your 500Mhz or 622Mhz EV56.

Dwight Manley's note discusses the merits of an "E-float" implementation,
which is waht the IBM RS6000 does. Note that X-float compares very favorably
to the semi-hardware E-float.

     <<< TURRIS::DISK$NOTES_PACK:[NOTES$LIBRARY]DEC_FORTRAN_ALPHA.NOTE;1 >>>
                           -< DEC Fortran on ALPHA >-
================================================================================
Note 1693.0               REAL*16 - why is it so fast?                 2 replies
TLE::WHITLOCK "Stan Whitlock"                        17 lines   1-AUG-1994 11:09
--------------------------------------------------------------------------------
================================================================================
Note 1689.4         Problem with extrnal function declaration             4 of 5
CERN::HOBBS "Budweiser - official embarrassment of " 10 lines   1-AUG-1994 08:04
                       -< REAL*16 - why is it so fast? >-
--------------------------------------------------------------------------------
I just got stopped by someone at CERN who wanted to comment about how fast
the REAL*16 support is in VMS 6.1.  Apparently, he had to double-check his
figures because it seemed too good to be true.

Is there any information available to explain what tricks are pulled in the
RTL to support X-float?

I'd like to toot our own horn a bit here ;-)

-cw

================================================================================
WIBBIN::NOYCE "DEC 21064-200DX5 : 138 SPECint @ $36K" 5 lines   1-AUG-1994 09:56
--------------------------------------------------------------------------------
Well, having 64-bit integer registers and 64-bit integer arithmetic
helps a lot.  The routines for the fundamental operations add, sub, mul, div
are carefully hand-coded, with a great deal of overlap, especially in the
multiply routine.  The routines receive arguments and return results in
registers, not memory.

================================================================================
GEMGRP::GROVE                                         7 lines   5-AUG-1994 15:58
--------------------------------------------------------------------------------
    See note 1011, esp 1011.6-1011.8
    
    The RTL primitives that Steve Root wrote are really outstanding,
    and the compiled code does a good job on register allocation and
    very lightweight linkages to the RTL routines.
    
    /Rich

================================================================================
Note 1011.3                    REAL*16 White Paper                       3 of 10
NICCTR::MANLEY                                      628 lines   4-AUG-1993 14:00
                 -< Enhanced Precision vs Extended Precision >-
--------------------------------------------------------------------------------


I encourage the DEC-FORTRAN group to seriously consider supporting both IEEE
extended precision floating point (this is being done) and IBM RS/6000 enhanced
precision floating point (this is not being done). Perhaps this could be 
provided by KAP-FORTRAN. Kuck and Associates, Inc. has provided exactly this
sort of REAL*16 support for other hardware vendors in the past. Now let me
explain the motivation for my proposal.


Clearly, IEEE 754 extended precision floating point support is very important.
Our customers expect it and HP provides it. We must support it to meet customer
needs and to compete with HP. However, IEEE extended precision support carries
a large performance penalty. Extended precision floating point will not be
competitive with IBM's enhanced precision floating point. To compete with IBM,
we must also support enhanced precision floating point.


Enhanced precision floating point operands are represented by pairs of double
precision floating point operands. Arithmetic operations are carried out using
double precision floating point instructions. Either IEEE or VAX G_FLOAT format
double precision operands may be used to represent enhanced floating point
operands. The exponent field size of an enhanced floating point operand is
identical to that of a double precision floating point operand. The effective
significand field of enhanced precision floating point is nearly twice that of
double precision floating point. Thus, enhanced precision floating point is
much more accurate than double precision floating point. Enhanced precision
floating point is not IEEE 754 compliant.


Extended precision floating point, on the other hand, is IEEE 754 compliant.
With extended precision floating point, both exponent and significand field
sizes are larger than they are for enhanced precision floating point. Thus
accuracy is improved and range is extended. Unfortunately, there is very
little support in the Alpha architecture to aid in making software emulation
of IEEE extended precision perform well. We need an alternative.


The remainder of the note contains four subroutines and two programs. They 
support enhanced precision floating point arithmetic. The subroutines
perform enhanced floating point add, subtract, multiply, and divide arithmetic
operations. The first program tests the accuracy of enhanced precision floating
point arithmetic relative to H_FLOAT arithmetic. Run it to see that most folks
don't need H_FLOAT for accuracy! A second program times enhanced floating point 
arithmetic operations. Run it to compare enhanced floating point performance
to that of both H_FLOAT and G_FLOAT.


	- Dwight -
================================================================================
Note 1011.6                    REAL*16 White Paper                       6 of 10
HPCGRP::MANLEY                                       50 lines  26-APR-1994 19:58
                            -< REAL*16 - Nice Job! >-
--------------------------------------------------------------------------------


Re: .4,.5


I just completed a quick performance sanity test comparing the REAL*16 and
unpipelined REAL*8 basic operations (Add, Subtract, Multiply, and Divide).

I also compared the REAL*16 operations to a FORTRAN implementation using
pairs of floating point values ala IBM.

On our VAX 6000, the same FORTRAN code beats the pants off ALL H_Floating
operations. I expected the worst!

Guess what, I'm pleasantly surprised!

The REAL*16 primitives on Alpha have about the same performance for Add,
Subtract, and Multiply as the FORTRAN routines. The REAL*16 divide operation
is about 3 time faster than the Newton-Raphson reciprocal (quick and dirty)
approach.

The performance results follow (E_Float uses two 64 bit floats):

  
  Timing ---      1000000 E-Precision Arithmetic Operations 
  
  
  Time for E_Float Add Operations  0.5397949     secs.
  Time for X_Float Add Operations  0.8000488     secs.
  Time for G_Float Add Operations  4.9804688E-02 secs.
  
  Time for E_Float Sub Operations  0.5798340     secs.
  Time for X_Float Sub Operations  0.6699219     secs.
  Time for G_Float Sub Operations  5.0048828E-02 secs.
  
  Time for E_Float Mul Operations  0.9299316     secs.
  Time for X_Float Mul Operations  0.8598633     secs.
  Time for G_Float Mul Operations  6.0058594E-02 secs.
  
  Time for E_Float Div Operations   7.009766     secs.
  Time for X_Float Div Operations   2.300049     secs.
  Time for G_Float Div Operations  0.4399414     secs.
  

You've done a nice job, especially on the divide operation. (I heard
a rumor that there's some "scrabble magic" imbedded in that code.) 


	- Dwight -

================================================================================
Note 1011.7                    REAL*16 White Paper                       7 of 10
GEMGRP::GROVE                                        12 lines  27-APR-1994 08:35
                             -< Roll the credits! >-
--------------------------------------------------------------------------------
    Dwight, thanks for the measurements and the posting.
    
    Steve Root (from the VSSAD group in Hudson) and Lucy Hamnett (GEM)
    did a great job on the REAL*16 implementation:

    	Steve did the high-performance arithmetic primitives

    	Lucy was project leader for the whole effort, and designed
    	and implemented the GEM compiler support for X-float

    This is a really nice piece of work!
    Rich Grove
================================================================================
Note 1011.8                    REAL*16 White Paper                       8 of 10
AD::ROOT                                             36 lines   3-MAY-1994 23:56
                            -< updated perf. data >-
--------------------------------------------------------------------------------
		UPDATED PERFORMANCE DATA

I made some changes to Dwight's benchmark to reduce unnecessary
overhead. The results for 10^7 operations are given below. The
times make sense, except that the X_Float div to mul ratio seems
high, and E_Float Mul became more expensive.
(Note, that the G_Float operations except for div, 'should' check
in at .4 seconds.)

These numbers for X_Float and E_Float are slightly optimistic,
compared to what a customer may achieve in a real job, in that
I_Cache traffic is minimized. The E_Float numbers are slightly
optimistic in that they incur no call overhead. The X_Float add/sub
numbers are slightly optimistic in that exponent mispredicts are
probably a little low compared to a real job. The X_Float sub
numbers are probably a little low in that fewer complete
normalizations are incurred.


 Time for E_Float Add Operations   4.750000     secs.
 Time for X_Float Add Operations   4.260000     secs.
 Time for G_Float Add Operations  0.3999996     secs.
 
 Time for E_Float Sub Operations   4.780000     secs.
 Time for X_Float Sub Operations   4.380001     secs.
 Time for G_Float Sub Operations  0.4099998     secs.
 
 Time for E_Float Mul Operations   12.63000     secs.
 Time for X_Float Mul Operations   7.429998     secs.
 Time for G_Float Mul Operations  0.4000015     secs.
 
 Time for E_Float Div Operations   69.91000     secs.
 Time for X_Float Div Operations   21.80000     secs.
 Time for G_Float Div Operations   4.240005     secs.

(Code available upon request.)
1251.7real*16 modRHETT::HALETKYWed Apr 09 1997 12:1242
    Hello,
    
    
    Cusotmer suggests the following:
    
    Regarding the 16-byte mod, I came up with a bitwise multiply-
    and-mod code last night which works, have enclosed it below
    (one could also use similar for modding the product of two
    standard 4-byte integers, but this is less crucial since the
    Alpha has the 8-byte integer type). A fast machine-code implementation
    of similar would be great; it might even be considered as an
    extension to the F90 intrinsic function library on the DEC F90
    compiler:
    
            program bitwise_mod
    !...performs a bitwise-multiply-and-mod on 3 integers x,y,z
    !   to obtain x*y mod z without risking integer overflow.
            integer*8 :: x,y,z,sum
            logical   :: flag
            print*,'enter x,y,z'
            read*,x,y,z
            if(x>z)x=mod(x,z)
            if(y>z)y=mod(y,z)
    !   this next line is for comparison only, and only works if x*y
    !   CAN be stored in in 8-byte integer...
            print*,'exact mod =',mod(x*y,z)
            sum=0
            if(btest(y,0))sum=x
            do
            y=y/2 ! could also use y=ishft(y,-1) here, if it's faster
            if(y==0)exit
              flag=btest(y,0)
              x=ishft(x,1)
              if(x>z)x=x-z
              if(flag)then
                sum=sum+x
                if(sum>z)sum=sum-z
              endif
            enddo
            print*,'x*y mod z =',sum
            end program bitwise_mod
    
1251.8COMEUP::SIMMONDSloose canonThu Apr 10 1997 00:197
    Re: .7
    
|                              -< real*16 mod >-
    [...]
|            integer*8 :: x,y,z,sum
    
    Huh?
1251.9QUARK::LIONELFree advice is worth every centThu Apr 10 1997 12:176
I think the customer was using real*16 as a way of manipulating 64-bit
integers without worrying about overflow.  real*8 won't cut it.  If this is
the case, then specialized routines for mod or whatever they want are the
more appropriate solution.

				Steve