| Title: | Languages |
| Notice: | Speaking In Tongues |
| Moderator: | TLE::TOKLAS::FELDMAN |
| Created: | Sat Jan 25 1986 |
| Last Modified: | Wed May 21 1997 |
| Last Successful Update: | Fri Jun 06 1997 |
| Number of topics: | 394 |
| Total number of notes: | 2683 |
(For a presentation I'm giving in my class at school). Cross posted in Languages and Alphanotes. This is an Alpha architecture question, but I couldn't find a conference dedicated to that subject. Why is instruction scheduling important? I think I understand the problems of multiple-issue, but I don't have any solid examples. The paper we're working on (Blickstein et al., The GEM Optimizing compiler system) talks to some length about HOW it does instruction scheduling, in two passes, once on intermediate language, then once again at the instruction level. What is missing is a good, clear explanation of WHY, and what the criteria are for deciding when to schedule instructions. Is there a good description available that I can refer to? Specifically, the paper talks about an example similar to this: (This is from memory, so there may be syntax errors, but the idea is right.) To implement B = A; D = C; Bad: ldq r0, a(fp) ; get a stq r0, b(fp) ; b = a ldq r0, c(fp) ; get c stq r0, d(fp) ; d = c Good: ldq r0, a(fp) ; get a ldq r1, c(fp) ; get c stq r0, b(fp) ; b = a stq r1, d(fp) ; d = c The text claims this saves 3 CPU cycles, but doesn't explain how. I can see, qualitatively, how the use of r0 blocks multiple issue, but I don't know quantitatively how to calculate that, nor how to see other bottlenecking sequences. I am looking for pointers to notes conferences or other documents. Thank you, JPB
| T.R | Title | User | Personal Name | Date | Lines |
|---|---|---|---|---|---|
| 389.1 | Some comments and an example | GEMGRP::GLOSSOP | Low volume == Endangered species | Tue Apr 18 1995 13:46 | 156 |
Instruction scheduling primarily does two different things:
- Better utilize hardware parallelism by having operations with non-zero
latency positioned so more of the latency is buried. (This is
the primary issue, since latencies are frequently multi-cycle.)
- Pack operations for asymmetric issue rules. (This is a relatively
minor effect in most cases relative to the first, and only matters
for machines that can issue more than one operation per cycle,
and further, can only issue particular combinations.)
For the load/store/load/store case, the effect is to remove latency.
Graphically, since load latency feeding store is 3, and EV4 is
a dual-issue machine that can issue 1 memory op/cycle:
Subcycle 0 Subcycle 1
------------------------ ------------------------
C 0 ldq r0,(r16) - -
y 1 - - - -
c 2 - - - -
l 3 - - stq r0,(r17)
e 4 ldq r1,(r18) - -
5 - - - -
6 - - - -
7 - - stq r1,(r19)
Now, after scheduling, this looks like:
Subcycle 0 Subcycle 1
------------------------ ------------------------
C 0 ldq r0,(r16) - -
y 1 - - ldq r1,(r18)
c 2 - - - -
l 3 stq r0,(r17) - -
e 4 - - stq r1,(r19)
Now, consider a slightly more complex example:
subroutine s(a, b, c, d, r, a1, b1, c1, d1, r1)
double precision a, b, c, d, r, a1, b1, c1, d1, r1
r = (a + b) * (c + d)
r1 = (a1 + b1) * (c1 + d1)
end
For a double precision floating point operation. For the 21064 (EV4),
the latencies of interest are: ldt - 3 cycles (if a primary cache
hit) and floating ops - 6 to another float op, 5 to a floating store.
The display used below was generated by the instruction scheduler. (This
was compiled with only the later scheduler enabled to show the full effect
of scheduling at one time. For more information on reading scheduling dumps,
see gemgrp""::smop$:<glossop.public>reading-a-sched-dump.txt. To reproduce
this specific case, use /switch=noil_schedule/dump=sched_*.)
Since EV4 issues instructions in pairs, the first instruction in a pair
is always in the left hand column, while the second is in the right hand,
even if the instructions issued in different cycles.
In the unscheduled code you see the "expected" code generation, load
operands from memory for an operation, perform the operation, then
finally, store the result. The net result is 49 cycles on EV4 (if all
memory references are cache hits.)
Unscheduled code (packed display)
Subcycle 0 Subcycle 1
------------------------ ------------------------
0 ldt f0,(r16) - -
1 - - ldt f1,(r17)
2 - - - -
3 - - - -
4 addt f0,f1,f0 ldt f10,(r18)
5 ldt f11,(r19) - -
6 - - - -
7 - - - -
8 - - addt f10,f11,f10
9 - - - -
10 - - - -
11 - - - -
12 - - - -
13 - - - -
14 mult f0,f10,f0 - -
15 - - - -
16 - - - -
17 - - - -
18 - - stt f0,(r20)
19 ldt f12,(r21) - -
20 - - ldq r1,(sp)
21 - - - -
22 - - - -
23 ldt f13,(r1) - -
24 - - - -
25 - - - -
26 - - addt f12,f13,f12
27 ldq r2,8(sp) - -
28 - - - -
29 - - - -
30 - - ldt f14,(r2)
31 ldq r3,16(sp) - -
32 - - - -
33 - - - -
34 - - ldt f15,(r3)
35 - - - -
36 - - - -
37 addt f14,f15,f14 - -
38 - - - -
39 - - - -
40 - - - -
41 - - - -
42 - - - -
43 - - mult f12,f14,f12
44 ldq r4,24(sp) - -
45 - - - -
46 - - - -
47 - - stt f12,(r4)
48 ret r26
With instruction scheduling, you see both the effects of trying to remove
latency, and, in some cycles like 8 and 9, you see the effects of the
compiler packing different instruction types to a single instruction
issue group. (If the processor could issue 2 of anything without
functional unit restrictions, the loads for f13/f14 would be able to issue
together, and the scheduler wouldn't have to worry about the issue positions
nearly as much.) Note that this example was contrived mainly to show
the scheduler burying latency, though there are equivalent, but less
dramatic, examples where issue pairing matters more.
Scheduled code (packed display)
Subcycle 0 Subcycle 1
------------------------ ------------------------
0 ldq r1,(sp) - -
1 - - ldq r2,8(sp)
2 ldq r3,16(sp) - -
3 - - ldt f0,(r16)
4 ldt f1,(r17) - -
5 - - ldt f10,(r18)
6 ldt f11,(r19) - -
7 - - ldt f12,(r21)
8 ldt f13,(r1) addt f0,f1,f0
9 ldt f14,(r2) addt f10,f11,f10
10 ldt f15,(r3) - -
11 - - addt f12,f13,f12
12 ldq r4,24(sp) - -
13 - - addt f14,f15,f14
14 - - - -
15 mult f0,f10,f0 - -
16 - - - -
17 - - - -
18 - - - -
19 - - mult f12,f14,f12
20 stt f0,(r20) - -
21 - - - -
22 - - - -
23 - - stt f12,(r4)
24 ret r26
| |||||