[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::fortran

Title:Digital Fortran
Notice:Read notes 1.* for important information
Moderator:QUARK::LIONEL
Created:Thu Jun 01 1995
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1333
Total number of notes:6734

747.0. "Validating REAL data on input (e.g., checking for denormalized numbers)" by HDLITE::NEWMAN (Chuck Newman, 508/467-5499 (DTN 297), MRO1-3/F28) Wed Apr 24 1996 18:01

T.RTitleUserPersonal
Name
DateLines
747.1TLE::EKLUNDAlways smiling on the inside!Wed Apr 24 1996 18:3219
747.2How about an environment variable?HDLITE::NEWMANChuck Newman, 508/467-5499 (DTN 297), MRO1-3/F26Wed Apr 24 1996 23:3419
747.3Allow user control, if implementedSUBPAC::FARICELLIThu Apr 25 1996 07:549
747.4How about passing the info to the rtlFORMAT::KIMBALLKeith Kimball 381-0573Thu Apr 25 1996 10:136
747.5Why penalize all IO?SUBPAC::FARICELLIThu Apr 25 1996 11:1111
747.6EXP does it right, why can't READSTEVEN::hobbsSteve HobbsThu Apr 25 1996 15:5032
747.7Use the RTL flag parameterELBANE::GROVEThu Apr 25 1996 19:0013
747.8workingTLE::WHITLOCKStan WhitlockFri Apr 26 1996 10:2616
747.9What's the current status with this?HDLITE::NEWMANChuck Newman, 508/467-5499 (DTN 297), MRO1-3/F26Tue Aug 13 1996 14:499
747.10when minutes seem like hoursTLE::WHITLOCKStan WhitlockWed Aug 21 1996 09:0510
747.11Half year ticklerHYDRA::NEWMANChuck Newman, 508/467-5499 (DTN 297), MRO1-3/F26Tue Feb 04 1997 11:5962
The vendor is going to have to build their product with
-fpe1 so that it doesn't die when reading data which was
written by other systems (e.g., 1.0e-45 on SGI).

Any ideas as to whether it would be reasonable to write a
"fixup" routine?  I'm not sure what it should look like for
optimal performance.  Here's the routine I was thinking of.
I compiled it with Fortran V4.1 as follows (I'm really only
concerned with denorms):

	f77 -fpe1 -tune ev5 -unroll 1

        subroutine fixup(array, length)
        integer length
        real array(length)
        real ZERO/0.0/
        integer ndx
        integer UNROLL
        parameter(UNROLL=16)
        do ndx = 1, length-(UNROLL-1), UNROLL
          array(ndx+  0) = array(ndx+  0) + ZERO
          array(ndx+  1) = array(ndx+  1) + ZERO
          array(ndx+  2) = array(ndx+  2) + ZERO
          array(ndx+  3) = array(ndx+  3) + ZERO
          array(ndx+  4) = array(ndx+  4) + ZERO
          array(ndx+  5) = array(ndx+  5) + ZERO
          array(ndx+  6) = array(ndx+  6) + ZERO
          array(ndx+  7) = array(ndx+  7) + ZERO
          array(ndx+  8) = array(ndx+  8) + ZERO
          array(ndx+  9) = array(ndx+  9) + ZERO
          array(ndx+ 10) = array(ndx+ 10) + ZERO
          array(ndx+ 11) = array(ndx+ 11) + ZERO
          array(ndx+ 12) = array(ndx+ 12) + ZERO
          array(ndx+ 13) = array(ndx+ 13) + ZERO
          array(ndx+ 14) = array(ndx+ 14) + ZERO
          array(ndx+ 15) = array(ndx+ 15) + ZERO
          array(ndx+ 1) = array(ndx+ 1) + ZERO
        end do
        do ndx = ndx, length
          array(ndx) = array(ndx) + ZERO
        end do
        return
        end

Looking at the machine instructions, there are 4 trapb instructions,
which I wouldn't think would be optimal (i.e., it will have to wait
for a trap that will likely never happen.

Compiling -fpe0 (which of course has no trapbs and has adds instead
of addssu) schedules the instructions VERY differently.

The only other thing I can think of that might hurt performance
with the code I have above would be all the stores.
Optimally, the stores would only happen if there was a trap, but
I would think the overhead of doing the check would be greater
than the overhead of doing the store.

Would it make sense to do the check?  If so, how would it be done?

Would it be a function of how frequently there are denorms?

								-- Chuck Newman
747.12Treat reals5 as integers in your filter ...HPCGRP::MANLEYTue Feb 04 1997 14:366
Since "fixup" is a subroutine, "array" may just as well be declared as an array
of integers. Then, for each array element, you may extract the exponent, test
it for zero, and clear the element if need be. You will store elements only
when they're cleared and you'll never trap.

747.13FP_CLASS makes it easierTLE::WHITLOCKStan WhitlockTue Feb 04 1997 16:4242
So what you have is IEEE binary data read into an array from some unclean
system and you want to stomp the unclean data {denorms, infinities, NaNs}
to 0 {because you want to run with -fpe0 and you don't wanna die}.

There is the intrinsic FP_CLASS

        -   FP_CLASS(x) returns an integer that indicates the nature of its
            REAL or DOUBLE PRECISION argument.  /usr/include/fordef.f contains
            the definitions of the return value:

                PARAMETER FOR_K_FP_SNAN = '00000000'X
                PARAMETER FOR_K_FP_QNAN = '00000001'X
                PARAMETER FOR_K_FP_POS_INF = '00000002'X
                PARAMETER FOR_K_FP_NEG_INF = '00000003'X
                PARAMETER FOR_K_FP_POS_NORM = '00000004'X
                PARAMETER FOR_K_FP_NEG_NORM = '00000005'X
                PARAMETER FOR_K_FP_POS_DENORM = '00000006'X
                PARAMETER FOR_K_FP_NEG_DENORM = '00000007'X
                PARAMETER FOR_K_FP_POS_ZERO = '00000008'X
                PARAMETER FOR_K_FP_NEG_ZERO = '00000009'X

You could code

        subroutine fixup(array, length)
        integer length
        real array(length)
        integer i
        do i = 1, length
which:	  select case (fp_class (array (i)))
	  case (4,5,8)
	    exit which
	  case (0,1,2,3,6,7,9)
	    array (i) = 0.0
	  end select which
        end do
        return
        end

FP_CLASS can touch a floating point number {it's generic so either REAL*4 or *8}
and not get an exception.

/Stan
747.14Using float seems faster, esp when few denormsHYDRA::NEWMANChuck Newman, 508/467-5499 (DTN 297), MRO1-3/F26Tue Feb 04 1997 18:1726
Well, my code in .11 (without the extra "array(ndx+1)" statement)
seems to be faster by a factor of ~2 - 5, depending on the number of
denorms, than .12 (with statements like one of the following):

      if (((array(ndx) .and. '7f800000'x) .eq. 0)
     1  .and. (array(ndx) .ne. 0) ) array(ndx) = 0


      if (((array(ndx) .and. '7f800000'x) .eq. 0) ) array(ndx) = 0


This is from stuffing the "bad" value into an integer
equivalenced to a real.

If the denorm comes as the result of a calculation
(e.g., if I add 0.0 to the denorm in the main program
and compile it -fpe3), my routine takes much longer�, and
has the bonus feature that the denorm doesn't get fixed up!

I link with -fpe0, but I suspect there is one handler for
the whole image, and it's the one associated with the
"highest" -fpe level specified.


� Takes much longer when there are lots of denorms.  For few
  denorms .11 is still faster
747.152nd try at FP_CLASSTLE::WHITLOCKStan WhitlockWed Feb 05 1997 08:3948
My code in .13 is wrong... SELECT CASE is a Fortran 90 construct that is
supported by f77 but I didn't spell it right:

    o	"WHICH:" is a construct name that has to start in the statement
	field {> column 6}

    o	EXIT exits DO loops - it's an extension I never got implemented that
	EXIT should take you out of named constructs like CASE.

So the correct code is

        subroutine fixup(array, length)
        integer length
        real array(length)
        integer i
        do i = 1, length
	  select case (fp_class (array (i)))
!	  case (4,5,8)		! +/- normal numbers and +0
!	    nothing to do
	  case (0,1,2,3,6,7,9)	! NaNs, +/- infinities, +/- denorms, -0
	    array (i) = 0.0
	  end select
        end do
        return
        end

RE: .14

-fpe does 2 things:

    o	compiles code that does or does not allow trap shadows {which allow
	exceptions to be fixed up}

    o	when used on the main program, makes a call to the RTL handler to tell
	it what fpe semantics are requested

So if you compile the main program -fpe3 and a subroutine -fpe0 and get an
exception inside the subroutine, bad things may happen since the subroutine code
is not structured to allow the handler to fix up the exception.

In your case, you want the main program compiled -fpe0 and the subroutine at
-fpe1.  Then inside your subroutine, you have to call for_set_fpe to set the
handler to fpe 1 {and then reset before you exit}.  See for_set_fpe(3f) for
details.

But FP_CLASS is a much easier way to do it and everybody can be compiled -fpe0.

/Stan
747.16Use 5IAND not .AND.HPCGRP::MANLEYWed Feb 05 1997 09:5221
> Well, my code in .11 (without the extra "array(ndx+1)" statement)
> seems to be faster by a factor of ~2 - 5, depending on the number of
> denorms, than .12 (with statements like one of the following):
>
>      if (((array(ndx) .and. '7f800000'x) .eq. 0)
>     1  .and. (array(ndx) .ne. 0) ) array(ndx) = 0
>
>
>      if (((array(ndx) .and. '7f800000'x) .eq. 0) ) array(ndx) = 0

I had something like

	if( iand( array(i),'7f800000'x ).eq.0 ) array(i)=0

in mind.

Frankly, I can't predict what will happen when logical "and"s are used where
boolean "and" are needed. Also, if you want to minimize stores, don't forget
that IEEE -0.0 does not equal integer 0.

747.17Floating ADD fastest when no denorms, otherwise integer exponent checkHYDRA::NEWMANChuck Newman, 508/467-5499 (DTN 297), MRO1-3/F26Wed Feb 05 1997 09:5939
re: .15:

�In your case, you want the main program compiled -fpe0 and the subroutine at
�-fpe1.  Then inside your subroutine, you have to call for_set_fpe to set the
�handler to fpe 1 {and then reset before you exit}.  See for_set_fpe(3f) for
�details.

I compiled with those switches, but I didn't do any calls to for_set_fpe
in my subroutine and it works handsomely.  When I add the calls to for_set_fpe
the routine stops cleaning up the denorms.  I've tried passing it FPE_M_TRAP_UND
and FPE_M_TRAP_INV.The documentation here is kinda sketchy.

It looks like the fastest thing would be to use the Floating case until I get
a denorm, then use the integer case.  How can I count the traps?  Is there a
data structure I can hook up to?

Here's the timings for now:

		Floating	Integer		Case
Case A1		8.148438	.0078125	.0054688
Case B1		8.941406	2.476563	4.789063
Case C1		1.535156	3.054688	4.929688

Case A2		.0078125	.0117188	.0585938
Case B2		.3906250	2.539063	4.769531
Case B2		.6250000	3.046875	4.976563


Case A1:  Clean    16 reals; the first 8 are denorms.  Do this many times
Case B1:  Clean  1600 reals; the first 8 are denorms.  Do this many times
Case C1:  Clean 16000 reals; the first 8 are denorms.  Do this many times

Case A2:  Clean    16 reals; no denorms.  Do this many times
Case B2:  Clean  1600 reals; no denorms.  Do this many times
Case C2:  Clean 16000 reals; no denorms.  Do this many times

Floating:  Add 0.0 to the number and store it back.  Compile -fpe1
Integer:   Check the exponent.  If zero, store a zero.
Case:      Check fp_class.  If Nan, infinity, denorm, or -0 store a zero.
747.18who said speed is beautiful?TLE::WHITLOCKStan WhitlockWed Feb 05 1997 16:3545
>>I compiled with those switches, but I didn't do any calls to for_set_fpe
>>in my subroutine and it works handsomely.  When I add the calls to for_set_fpe
>>the routine stops cleaning up the denorms.  I've tried passing it
>>FPE_M_TRAP_UND and FPE_M_TRAP_INV.The documentation here is kinda sketchy.

You want to set FPE_M_ABRUPT_UND - that's what says "stomp underflow to 0.0".
But on Alpha, at -fpe0, it's not the handler that does the stomping, it's the
hardware;  that's why it's fast.  At -fpe1, there are trap shadows, the handler
is called, and the ABRUPT_UND is done.

>>It looks like the fastest thing would be to use the Floating case until I get
>>a denorm, then use the integer case.  How can I count the traps?  Is there a
>>data structure I can hook up to?

The RTL counts exceptions but those counts aren't available to you.

Something doesn't make sense here... when you load a denorm, it's not an
Underflow, it's an Invalid.  That's why you need -fpe1 in order to touch the
data.  -fpe1 has to have trap shadows and that slows stuff down.  When the
"denorm + 0.0" is done, you get an underflow and the handler stomps it to 0.0.

So what we want is -fpe0 and a way to touch the data so it won't get an Invalid
if the data is a denorm.  And we want speed...

	do i = 1,n
	  if ((a(i).and. exponent-mask) .eq. 0) a(i) = 0
	end do

won't unroll the loop.  Plus the fastest thing to do is get the IF turned into
a CMOVE instruction.  So the speed wizards suggest

	do i=1,n,16		! hand unrolled by 16, for example
	  ! use DFOR$PREFETCH to prefetch elements I+17 thru I+32
	  ! like CALL DFOR$PREFETCH (A(I+17)) etc
	  K0  = A(I   ) .AND. '7f800000'X
	  K1  = A(I+ 1) .AND. '7f800000'X
	  ...
	  K15 = A(I+15) .AND. '7f800000'X
	  IF (K0  .EQ. 0) A(I   ) = 0		! these should become CMOVEs
	  IF (K1  .EQ. 0) A(I+ 1) = 0
	  ...
	  IF (K15 .EQ. 0) A(I+15) = 0
	end do

I hope something here has helped.			/Stan
747.19Sure would be nice if I could get to the RTL's exception counts...HYDRA::NEWMANChuck Newman, 508/467-5499 (DTN 297), MRO1-3/F26Thu Feb 06 1997 09:4687
�>>I compiled with those switches, but I didn't do any calls to for_set_fpe
�>>in my subroutine and it works handsomely.  When I add the calls to for_set_fpe
�>>the routine stops cleaning up the denorms.  I've tried passing it
�>>FPE_M_TRAP_UND and FPE_M_TRAP_INV.The documentation here is kinda sketchy.
�
�You want to set FPE_M_ABRUPT_UND - that's what says "stomp underflow to 0.0".
�But on Alpha, at -fpe0, it's not the handler that does the stomping, it's the
�hardware;  that's why it's fast.  At -fpe1, there are trap shadows, the handler
�is called, and the ABRUPT_UND is done.

Thanks, using FPE_M_ABRUPT_UND does it.  Is there an explaination of what these
mean anywhere?

I understand about who squashes the denorms.  The hardware will only do it if
the RESULT of an operation is a denorm.  If the INPUT to an operation is a
denorm then a handler is needed.  My reasoning for doing what I did was as
follows:

1)  If there aren't any invalid data, this should scream right along at one
datum per cycle -- 2 memory accesses and one floating operation.

2)  If I unroll by 16, there should only need to be 1 trapb at the end of each
iteration of the loop since there are enough unique registers for the trap
handler to figure out what's going on.  I should even be able to unroll quite a
bit more.  Seems, however, that the compiler doesn't do scheduling to minimize
the number of trapb instructions, so I can only unroll by 6 if I only want one
trapb (Well, I can understand 2 trapb's -- 1 before the first store (leaving a
few more add's to do, and one before the stores for those last few adds -- but I
don't think the compiler should generate any more than that).

�>>It looks like the fastest thing would be to use the Floating case until I get
�>>a denorm, then use the integer case.  How can I count the traps?  Is there a
�>>data structure I can hook up to?
�
�The RTL counts exceptions but those counts aren't available to you.

Rats!!!  That may be my downfall.  I could do something with RPCCs at
the head and tail of the loop and compare that against some number, but that's
getting a little to ugly even for me.  Any besides, Digital UNIX occasionally
schedules "stuff" that runs in the process's context which will inflate this
number from time to time.  And there's cache misses to think about...

The penalty for those cases where there *are* denorms is too high to ignore.

�Something doesn't make sense here... when you load a denorm, it's not an
�Underflow, it's an Invalid.  That's why you need -fpe1 in order to touch the
�data.  -fpe1 has to have trap shadows and that slows stuff down.  When the
�"denorm + 0.0" is done, you get an underflow and the handler stomps it to 0.0.
�
�So what we want is -fpe0 and a way to touch the data so it won't get an Invalid
�if the data is a denorm.  And we want speed...
�
�       do i = 1,n
�         if ((a(i).and. exponent-mask) .eq. 0) a(i) = 0
�       end do
�
�won't unroll the loop.

I would think that you'll pay for the trap shadow a little from the trapb as
long as you don't have any denorms (but no more than 4 cycles -- the latency of
an adds).  Looking at the exponent requires a couple of integer operations and
some combination of (conditional move/store or test/conditional branch/ store).
If I remember correctly, you can only have two of (load/store/integer operation)
per cycle, so these won't multi-issue as well as using a floating add).

�Plus the fastest thing to do is get the IF turned into
�a CMOVE instruction.  So the speed wizards suggest
�
�       do i=1,n,16             ! hand unrolled by 16, for example
�         ! use DFOR$PREFETCH to prefetch elements I+17 thru I+32
�         ! like CALL DFOR$PREFETCH (A(I+17)) etc
�         K0  = A(I   ) .AND. '7f800000'X
�         K1  = A(I+ 1) .AND. '7f800000'X
�         ...
�         K15 = A(I+15) .AND. '7f800000'X
�         IF (K0  .EQ. 0) A(I   ) = 0           ! these should become CMOVEs
�         IF (K1  .EQ. 0) A(I+ 1) = 0
�         ...
�         IF (K15 .EQ. 0) A(I+15) = 0
�       end do
�
�I hope something here has helped.                      /Stan

I'll have to try this and see what the compiler generates, but it looks like
you're trying to generate a conditional STORE, not a conditional MOVE.  If this
doesn't do it, however, I'm sure I can write it to induce a CMOV -- I've done it
before.  I'll let you know what that does to the performance.
747.20Latest integer code is fastestHYDRA::NEWMANChuck Newman, 508/467-5499 (DTN 297), MRO1-3/F26Thu Feb 06 1997 10:4229
As I suspected, Stan's suggestion didn't induce CMOV instructions, but it did
run pretty quick, and only did stores if it needed to.

I was able to induce CMOVs as follows:

          f0  = array(ndx)
          if (iand(f0  , '7f800000'x) .eq. 0) f0  = 0
          array(ndx) = f0

... which always did the store (as does the version with the floating add)

Both of these integer versions are about the same speed.

The two calls to for_set_fpe() has slowed down the floating code when there are
no exceptions, but sped it up when there are.

Looks like I'll use the integer code.



One (hopefully) final question, which will determine which variant of the
integer code to use:

Since the Alpha chips are read-allocate, is there any penalty for doing the
stores even when they aren't needed?  Doesn't seem to affect my toy program,
but what will the effect be (if any) on a real application?  If no impact on
ev4 and ev5, how about ev6?  Learning all the time...

								-- Chuck Newman
747.21Doing stores doubles your mem bandwidth requirementsWIDTH::MDAVISMark Davis - compiler maniacThu Feb 06 1997 10:575
and since you're doing almost nothing with each element, for large
arrays you will be running at mem bandwidth speed - doubling the 
requirement will double your time.  Try running your toy program
on an 8meg array.

747.22A better version when there are lots of zerosHYDRA::NEWMANChuck Newman, 508/467-5499 (DTN 297), MRO1-3/F26Wed Apr 02 1997 16:4387
My vendor is expecting mostly zeros, so Stan's routine would still run
at memory speed since it does the store for denorms and zeros.

The following is better when there are mostly zeros for two reasons --
it avoids the store and uses fewer instructions.  It also has a
single precision version and a double precision version.

F.W.I.W., The "mask" variable gave me troubles (I think I used V4.1)

1)  If I didn't use it but had the constant in-line, the compiler generated
	it via an LDAH for each datum.
2)  If I used the mask variable and initialized it with an implicit data
	statement, the compiler loaded it from memory for each datum.
3)  Only if I used the mask variable and set it with an assignment statement
	could I get the compiler to keep it in a register.

      options /extend_source
      subroutine fixup_4(array, length)
      implicit none
      integer length
      integer array(length)
      integer UNROLL
      parameter (UNROLL=16)
      integer ndx
      integer f0, f1, f2, f3, f4, f5, f6, f7, f8
      integer f9, f10, f11, f12, f13, f14, f15, f16
      integer mask_4
      mask_4 = '7f800000'x
      do ndx = 1, length-(UNROLL-1), UNROLL
        if ((iand(array(ndx +  0), mask_4) .eq. 0) .and. (array(ndx +  0) .ne. 0)) array(ndx +  0) = 0
        if ((iand(array(ndx +  1), mask_4) .eq. 0) .and. (array(ndx +  1) .ne. 0)) array(ndx +  1) = 0
        if ((iand(array(ndx +  2), mask_4) .eq. 0) .and. (array(ndx +  2) .ne. 0)) array(ndx +  2) = 0
        if ((iand(array(ndx +  3), mask_4) .eq. 0) .and. (array(ndx +  3) .ne. 0)) array(ndx +  3) = 0
        if ((iand(array(ndx +  4), mask_4) .eq. 0) .and. (array(ndx +  4) .ne. 0)) array(ndx +  4) = 0
        if ((iand(array(ndx +  5), mask_4) .eq. 0) .and. (array(ndx +  5) .ne. 0)) array(ndx +  5) = 0
        if ((iand(array(ndx +  6), mask_4) .eq. 0) .and. (array(ndx +  6) .ne. 0)) array(ndx +  6) = 0
        if ((iand(array(ndx +  7), mask_4) .eq. 0) .and. (array(ndx +  7) .ne. 0)) array(ndx +  7) = 0
        if ((iand(array(ndx +  8), mask_4) .eq. 0) .and. (array(ndx +  8) .ne. 0)) array(ndx +  8) = 0
        if ((iand(array(ndx +  9), mask_4) .eq. 0) .and. (array(ndx +  9) .ne. 0)) array(ndx +  9) = 0
        if ((iand(array(ndx + 10), mask_4) .eq. 0) .and. (array(ndx + 10) .ne. 0)) array(ndx + 10) = 0
        if ((iand(array(ndx + 11), mask_4) .eq. 0) .and. (array(ndx + 11) .ne. 0)) array(ndx + 11) = 0
        if ((iand(array(ndx + 12), mask_4) .eq. 0) .and. (array(ndx + 12) .ne. 0)) array(ndx + 12) = 0
        if ((iand(array(ndx + 13), mask_4) .eq. 0) .and. (array(ndx + 13) .ne. 0)) array(ndx + 13) = 0
        if ((iand(array(ndx + 14), mask_4) .eq. 0) .and. (array(ndx + 14) .ne. 0)) array(ndx + 14) = 0
        if ((iand(array(ndx + 15), mask_4) .eq. 0) .and. (array(ndx + 15) .ne. 0)) array(ndx + 15) = 0
      end do
      do ndx = ndx, length
        if ((iand(array(ndx), mask_4)  .eq. 0) .and. (array(ndx) .ne. 0)) array(ndx) = 0
      end do
      return
      end
c
      options /extend_source
      subroutine fixup_8(array, length)
      implicit none
      integer length
      integer*8 array(length)
      integer UNROLL
      parameter (UNROLL=16)
      integer ndx
      integer*8 f0, f1, f2, f3, f4, f5, f6, f7, f8
      integer*8 f9, f10, f11, f12, f13, f14, f15, f16
      integer*8 mask_8
      mask_8 = '7ff0000000000000'x
      do ndx = 1, length-(UNROLL-1), UNROLL
        if ((iand(array(ndx +  0), mask_8) .eq. 0) .and. (array(ndx +  0) .ne. 0)) array(ndx +  0) = 0
        if ((iand(array(ndx +  1), mask_8) .eq. 0) .and. (array(ndx +  1) .ne. 0)) array(ndx +  1) = 0
        if ((iand(array(ndx +  2), mask_8) .eq. 0) .and. (array(ndx +  2) .ne. 0)) array(ndx +  2) = 0
        if ((iand(array(ndx +  3), mask_8) .eq. 0) .and. (array(ndx +  3) .ne. 0)) array(ndx +  3) = 0
        if ((iand(array(ndx +  4), mask_8) .eq. 0) .and. (array(ndx +  4) .ne. 0)) array(ndx +  4) = 0
        if ((iand(array(ndx +  5), mask_8) .eq. 0) .and. (array(ndx +  5) .ne. 0)) array(ndx +  5) = 0
        if ((iand(array(ndx +  6), mask_8) .eq. 0) .and. (array(ndx +  6) .ne. 0)) array(ndx +  6) = 0
        if ((iand(array(ndx +  7), mask_8) .eq. 0) .and. (array(ndx +  7) .ne. 0)) array(ndx +  7) = 0
        if ((iand(array(ndx +  8), mask_8) .eq. 0) .and. (array(ndx +  8) .ne. 0)) array(ndx +  8) = 0
        if ((iand(array(ndx +  9), mask_8) .eq. 0) .and. (array(ndx +  9) .ne. 0)) array(ndx +  9) = 0
        if ((iand(array(ndx + 10), mask_8) .eq. 0) .and. (array(ndx + 10) .ne. 0)) array(ndx + 10) = 0
        if ((iand(array(ndx + 11), mask_8) .eq. 0) .and. (array(ndx + 11) .ne. 0)) array(ndx + 11) = 0
        if ((iand(array(ndx + 12), mask_8) .eq. 0) .and. (array(ndx + 12) .ne. 0)) array(ndx + 12) = 0
        if ((iand(array(ndx + 13), mask_8) .eq. 0) .and. (array(ndx + 13) .ne. 0)) array(ndx + 13) = 0
        if ((iand(array(ndx + 14), mask_8) .eq. 0) .and. (array(ndx + 14) .ne. 0)) array(ndx + 14) = 0
        if ((iand(array(ndx + 15), mask_8) .eq. 0) .and. (array(ndx + 15) .ne. 0)) array(ndx + 15) = 0
      end do
      do ndx = ndx, length
        if ((iand(array(ndx), mask_8)  .eq. 0) .and. (array(ndx) .ne. 0)) array(ndx) = 0
      end do
      return
      end