[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::languages

Title:	Languages
Notice:	Speaking In Tongues
Moderator:	TLE::TOKLAS::FELDMAN

Created:	Sat Jan 25 1986
Last Modified:	Wed May 21 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	394
Total number of notes:	2683

389.0. "Instruction scheduling, multiple issue, pipelines" by SLBLUZ::BROCKUS (I'm the NRA! and I VoteD!) Mon Apr 17 1995 11:56

(For a presentation I'm giving in my class at school).
Cross posted in Languages and Alphanotes.
This is an Alpha architecture question, but I couldn't find a conference
dedicated to that subject.

Why is instruction scheduling important?  I think I understand the problems
of multiple-issue, but I don't have any solid examples.  The paper we're
working on (Blickstein et al., The GEM Optimizing compiler system) talks to
some length about HOW it does instruction scheduling, in two passes, once
on intermediate language, then once again at the instruction level.

What is missing is a good, clear explanation of WHY, and what the criteria
are for deciding when to schedule instructions.  Is there a good
description available that I can refer to?

Specifically, the paper talks about an example similar to this:
(This is from memory, so there may be syntax errors, but the idea is
right.)

To implement  B = A; D = C;

Bad:

ldq r0, a(fp)  ; get a
stq r0, b(fp)  ; b = a
ldq r0, c(fp)  ; get c
stq r0, d(fp)  ; d = c


Good:

ldq r0, a(fp)  ; get a
ldq r1, c(fp)  ; get c
stq r0, b(fp)  ; b = a
stq r1, d(fp)  ; d = c

The text claims this saves 3 CPU cycles, but doesn't explain how.  I can
see, qualitatively, how the use of r0 blocks multiple issue, but I don't
know quantitatively how to calculate that, nor how to see other
bottlenecking sequences.

I am looking for pointers to notes conferences or other documents.

Thank you,

JPB

T.R Title User Personal
Name Date Lines

389.1 Some comments and an example GEMGRP::GLOSSOP Low volume == Endangered species Tue Apr 18 1995 13:46 156

T.R	Title	User	Personal Name	Date	Lines
389.1	Some comments and an example	GEMGRP::GLOSSOP	Low volume == Endangered species	`Tue Apr 18 1995 13:46`	156
	Instruction scheduling primarily does two different things: - Better utilize hardware parallelism by having operations with non-zero latency positioned so more of the latency is buried. (This is the primary issue, since latencies are frequently multi-cycle.) - Pack operations for asymmetric issue rules. (This is a relatively minor effect in most cases relative to the first, and only matters for machines that can issue more than one operation per cycle, and further, can only issue particular combinations.) For the load/store/load/store case, the effect is to remove latency. Graphically, since load latency feeding store is 3, and EV4 is a dual-issue machine that can issue 1 memory op/cycle: Subcycle 0 Subcycle 1 ------------------------ ------------------------ C 0 ldq r0,(r16) - - y 1 - - - - c 2 - - - - l 3 - - stq r0,(r17) e 4 ldq r1,(r18) - - 5 - - - - 6 - - - - 7 - - stq r1,(r19) Now, after scheduling, this looks like: Subcycle 0 Subcycle 1 ------------------------ ------------------------ C 0 ldq r0,(r16) - - y 1 - - ldq r1,(r18) c 2 - - - - l 3 stq r0,(r17) - - e 4 - - stq r1,(r19) Now, consider a slightly more complex example: subroutine s(a, b, c, d, r, a1, b1, c1, d1, r1) double precision a, b, c, d, r, a1, b1, c1, d1, r1 r = (a + b) * (c + d) r1 = (a1 + b1) * (c1 + d1) end For a double precision floating point operation. For the 21064 (EV4), the latencies of interest are: ldt - 3 cycles (if a primary cache hit) and floating ops - 6 to another float op, 5 to a floating store. The display used below was generated by the instruction scheduler. (This was compiled with only the later scheduler enabled to show the full effect of scheduling at one time. For more information on reading scheduling dumps, see gemgrp""::smop$:<glossop.public>reading-a-sched-dump.txt. To reproduce this specific case, use /switch=noil_schedule/dump=sched_*.) Since EV4 issues instructions in pairs, the first instruction in a pair is always in the left hand column, while the second is in the right hand, even if the instructions issued in different cycles. In the unscheduled code you see the "expected" code generation, load operands from memory for an operation, perform the operation, then finally, store the result. The net result is 49 cycles on EV4 (if all memory references are cache hits.) Unscheduled code (packed display) Subcycle 0 Subcycle 1 ------------------------ ------------------------ 0 ldt f0,(r16) - - 1 - - ldt f1,(r17) 2 - - - - 3 - - - - 4 addt f0,f1,f0 ldt f10,(r18) 5 ldt f11,(r19) - - 6 - - - - 7 - - - - 8 - - addt f10,f11,f10 9 - - - - 10 - - - - 11 - - - - 12 - - - - 13 - - - - 14 mult f0,f10,f0 - - 15 - - - - 16 - - - - 17 - - - - 18 - - stt f0,(r20) 19 ldt f12,(r21) - - 20 - - ldq r1,(sp) 21 - - - - 22 - - - - 23 ldt f13,(r1) - - 24 - - - - 25 - - - - 26 - - addt f12,f13,f12 27 ldq r2,8(sp) - - 28 - - - - 29 - - - - 30 - - ldt f14,(r2) 31 ldq r3,16(sp) - - 32 - - - - 33 - - - - 34 - - ldt f15,(r3) 35 - - - - 36 - - - - 37 addt f14,f15,f14 - - 38 - - - - 39 - - - - 40 - - - - 41 - - - - 42 - - - - 43 - - mult f12,f14,f12 44 ldq r4,24(sp) - - 45 - - - - 46 - - - - 47 - - stt f12,(r4) 48 ret r26 With instruction scheduling, you see both the effects of trying to remove latency, and, in some cycles like 8 and 9, you see the effects of the compiler packing different instruction types to a single instruction issue group. (If the processor could issue 2 of anything without functional unit restrictions, the loads for f13/f14 would be able to issue together, and the scheduler wouldn't have to worry about the issue positions nearly as much.) Note that this example was contrived mainly to show the scheduler burying latency, though there are equivalent, but less dramatic, examples where issue pairing matters more. Scheduled code (packed display) Subcycle 0 Subcycle 1 ------------------------ ------------------------ 0 ldq r1,(sp) - - 1 - - ldq r2,8(sp) 2 ldq r3,16(sp) - - 3 - - ldt f0,(r16) 4 ldt f1,(r17) - - 5 - - ldt f10,(r18) 6 ldt f11,(r19) - - 7 - - ldt f12,(r21) 8 ldt f13,(r1) addt f0,f1,f0 9 ldt f14,(r2) addt f10,f11,f10 10 ldt f15,(r3) - - 11 - - addt f12,f13,f12 12 ldq r4,24(sp) - - 13 - - addt f14,f15,f14 14 - - - - 15 mult f0,f10,f0 - - 16 - - - - 17 - - - - 18 - - - - 19 - - mult f12,f14,f12 20 stt f0,(r20) - - 21 - - - - 22 - - - - 23 - - stt f12,(r4) 24 ret r26

Instruction scheduling primarily does two different things:

    - Better utilize hardware parallelism by having operations with non-zero
      latency positioned so more of the latency is buried.  (This is
      the primary issue, since latencies are frequently multi-cycle.)

    - Pack operations for asymmetric issue rules.  (This is a relatively
      minor effect in most cases relative to the first, and only matters
      for machines that can issue more than one operation per cycle,
      and further, can only issue particular combinations.)

For the load/store/load/store case, the effect is to remove latency.
Graphically, since load latency feeding store is 3, and EV4 is
a dual-issue machine that can issue 1 memory op/cycle:

        Subcycle 0                      Subcycle 1
        ------------------------        ------------------------
 C  0   ldq     r0,(r16)                -       -
 y  1	-	-			-	-
 c  2	-	-			-	-
 l  3	-	-			stq	r0,(r17)
 e  4   ldq     r1,(r18)                -       -
    5	-	-			-	-
    6	-	-			-	-
    7	-	-			stq	r1,(r19)

Now, after scheduling, this looks like:

        Subcycle 0                      Subcycle 1
        ------------------------        ------------------------
 C  0   ldq     r0,(r16)                -       -
 y  1	-	-			ldq     r1,(r18)
 c  2	-	-			-	-
 l  3	stq	r0,(r17)		-	-
 e  4   -	-			stq	r1,(r19)

Now, consider a slightly more complex example:

	subroutine s(a, b, c, d, r, a1, b1, c1, d1, r1)
	double precision a, b, c, d, r, a1, b1, c1, d1, r1
	r = (a + b) * (c + d)
	r1 = (a1 + b1) * (c1 + d1)
	end

For a double precision floating point operation.  For the 21064 (EV4),
the latencies of interest are: ldt - 3 cycles (if a primary cache
hit) and floating ops - 6 to another float op, 5 to a floating store.

The display used below was generated by the instruction scheduler.  (This
was compiled with only the later scheduler enabled to show the full effect
of scheduling at one time.  For more information on reading scheduling dumps,
see gemgrp""::smop$:<glossop.public>reading-a-sched-dump.txt.  To reproduce
this specific case, use /switch=noil_schedule/dump=sched_*.)

Since EV4 issues instructions in pairs, the first instruction in a pair
is always in the left hand column, while the second is in the right hand,
even if the instructions issued in different cycles.

In the unscheduled code you see the "expected" code generation, load
operands from memory for an operation, perform the operation, then
finally, store the result.  The net result is 49 cycles on EV4 (if all
memory references are cache hits.)

Unscheduled code (packed display)

        Subcycle 0                      Subcycle 1
        ------------------------        ------------------------
    0   ldt     f0,(r16)                -       -
    1   -       -                       ldt     f1,(r17)
    2   -       -                       -       -
    3   -       -                       -       -
    4   addt    f0,f1,f0                ldt     f10,(r18)
    5   ldt     f11,(r19)               -       -
    6   -       -                       -       -
    7   -       -                       -       -
    8   -       -                       addt    f10,f11,f10
    9   -       -                       -       -
   10   -       -                       -       -
   11   -       -                       -       -
   12   -       -                       -       -
   13   -       -                       -       -
   14   mult    f0,f10,f0               -       -
   15   -       -                       -       -
   16   -       -                       -       -
   17   -       -                       -       -
   18   -       -                       stt     f0,(r20)
   19   ldt     f12,(r21)               -       -
   20   -       -                       ldq     r1,(sp)
   21   -       -                       -       -
   22   -       -                       -       -
   23   ldt     f13,(r1)                -       -
   24   -       -                       -       -
   25   -       -                       -       -
   26   -       -                       addt    f12,f13,f12
   27   ldq     r2,8(sp)                -       -
   28   -       -                       -       -
   29   -       -                       -       -
   30   -       -                       ldt     f14,(r2)
   31   ldq     r3,16(sp)               -       -
   32   -       -                       -       -
   33   -       -                       -       -
   34   -       -                       ldt     f15,(r3)
   35   -       -                       -       -
   36   -       -                       -       -
   37   addt    f14,f15,f14             -       -
   38   -       -                       -       -
   39   -       -                       -       -
   40   -       -                       -       -
   41   -       -                       -       -
   42   -       -                       -       -
   43   -       -                       mult    f12,f14,f12
   44   ldq     r4,24(sp)               -       -
   45   -       -                       -       -
   46   -       -                       -       -
   47   -       -                       stt     f12,(r4)
   48   ret     r26

With instruction scheduling, you see both the effects of trying to remove
latency, and, in some cycles like 8 and 9, you see the effects of the
compiler packing different instruction types to a single instruction
issue group.  (If the processor could issue 2 of anything without
functional unit restrictions, the loads for f13/f14 would be able to issue
together, and the scheduler wouldn't have to worry about the issue positions
nearly as much.)  Note that this example was contrived mainly to show
the scheduler burying latency, though there are equivalent, but less
dramatic, examples where issue pairing matters more.

Scheduled code (packed display)

        Subcycle 0                      Subcycle 1
        ------------------------        ------------------------
    0   ldq     r1,(sp)                 -       -
    1   -       -                       ldq     r2,8(sp)
    2   ldq     r3,16(sp)               -       -
    3   -       -                       ldt     f0,(r16)
    4   ldt     f1,(r17)                -       -
    5   -       -                       ldt     f10,(r18)
    6   ldt     f11,(r19)               -       -
    7   -       -                       ldt     f12,(r21)
    8   ldt     f13,(r1)                addt    f0,f1,f0
    9   ldt     f14,(r2)                addt    f10,f11,f10
   10   ldt     f15,(r3)                -       -
   11   -       -                       addt    f12,f13,f12
   12   ldq     r4,24(sp)               -       -
   13   -       -                       addt    f14,f15,f14
   14   -       -                       -       -
   15   mult    f0,f10,f0               -       -
   16   -       -                       -       -
   17   -       -                       -       -
   18   -       -                       -       -
   19   -       -                       mult    f12,f14,f12
   20   stt     f0,(r20)                -       -
   21   -       -                       -       -
   22   -       -                       -       -
   23   -       -                       stt     f12,(r4)
   24   ret     r26

Conference turris::languages

389.0. "Instruction scheduling, multiple issue, pipelines" by SLBLUZ::BROCKUS (I&#039;m the NRA! and I VoteD!) Mon Apr 17 1995 11:56

389.0. "Instruction scheduling, multiple issue, pipelines" by SLBLUZ::BROCKUS (I'm the NRA! and I VoteD!) Mon Apr 17 1995 11:56