| Instruction scheduling primarily does two different things:
- Better utilize hardware parallelism by having operations with non-zero
latency positioned so more of the latency is buried. (This is
the primary issue, since latencies are frequently multi-cycle.)
- Pack operations for asymmetric issue rules. (This is a relatively
minor effect in most cases relative to the first, and only matters
for machines that can issue more than one operation per cycle,
and further, can only issue particular combinations.)
For the load/store/load/store case, the effect is to remove latency.
Graphically, since load latency feeding store is 3, and EV4 is
a dual-issue machine that can issue 1 memory op/cycle:
Subcycle 0 Subcycle 1
------------------------ ------------------------
C 0 ldq r0,(r16) - -
y 1 - - - -
c 2 - - - -
l 3 - - stq r0,(r17)
e 4 ldq r1,(r18) - -
5 - - - -
6 - - - -
7 - - stq r1,(r19)
Now, after scheduling, this looks like:
Subcycle 0 Subcycle 1
------------------------ ------------------------
C 0 ldq r0,(r16) - -
y 1 - - ldq r1,(r18)
c 2 - - - -
l 3 stq r0,(r17) - -
e 4 - - stq r1,(r19)
Now, consider a slightly more complex example:
subroutine s(a, b, c, d, r, a1, b1, c1, d1, r1)
double precision a, b, c, d, r, a1, b1, c1, d1, r1
r = (a + b) * (c + d)
r1 = (a1 + b1) * (c1 + d1)
end
For a double precision floating point operation. For the 21064 (EV4),
the latencies of interest are: ldt - 3 cycles (if a primary cache
hit) and floating ops - 6 to another float op, 5 to a floating store.
The display used below was generated by the instruction scheduler. (This
was compiled with only the later scheduler enabled to show the full effect
of scheduling at one time. For more information on reading scheduling dumps,
see gemgrp""::smop$:<glossop.public>reading-a-sched-dump.txt. To reproduce
this specific case, use /switch=noil_schedule/dump=sched_*.)
Since EV4 issues instructions in pairs, the first instruction in a pair
is always in the left hand column, while the second is in the right hand,
even if the instructions issued in different cycles.
In the unscheduled code you see the "expected" code generation, load
operands from memory for an operation, perform the operation, then
finally, store the result. The net result is 49 cycles on EV4 (if all
memory references are cache hits.)
Unscheduled code (packed display)
Subcycle 0 Subcycle 1
------------------------ ------------------------
0 ldt f0,(r16) - -
1 - - ldt f1,(r17)
2 - - - -
3 - - - -
4 addt f0,f1,f0 ldt f10,(r18)
5 ldt f11,(r19) - -
6 - - - -
7 - - - -
8 - - addt f10,f11,f10
9 - - - -
10 - - - -
11 - - - -
12 - - - -
13 - - - -
14 mult f0,f10,f0 - -
15 - - - -
16 - - - -
17 - - - -
18 - - stt f0,(r20)
19 ldt f12,(r21) - -
20 - - ldq r1,(sp)
21 - - - -
22 - - - -
23 ldt f13,(r1) - -
24 - - - -
25 - - - -
26 - - addt f12,f13,f12
27 ldq r2,8(sp) - -
28 - - - -
29 - - - -
30 - - ldt f14,(r2)
31 ldq r3,16(sp) - -
32 - - - -
33 - - - -
34 - - ldt f15,(r3)
35 - - - -
36 - - - -
37 addt f14,f15,f14 - -
38 - - - -
39 - - - -
40 - - - -
41 - - - -
42 - - - -
43 - - mult f12,f14,f12
44 ldq r4,24(sp) - -
45 - - - -
46 - - - -
47 - - stt f12,(r4)
48 ret r26
With instruction scheduling, you see both the effects of trying to remove
latency, and, in some cycles like 8 and 9, you see the effects of the
compiler packing different instruction types to a single instruction
issue group. (If the processor could issue 2 of anything without
functional unit restrictions, the loads for f13/f14 would be able to issue
together, and the scheduler wouldn't have to worry about the issue positions
nearly as much.) Note that this example was contrived mainly to show
the scheduler burying latency, though there are equivalent, but less
dramatic, examples where issue pairing matters more.
Scheduled code (packed display)
Subcycle 0 Subcycle 1
------------------------ ------------------------
0 ldq r1,(sp) - -
1 - - ldq r2,8(sp)
2 ldq r3,16(sp) - -
3 - - ldt f0,(r16)
4 ldt f1,(r17) - -
5 - - ldt f10,(r18)
6 ldt f11,(r19) - -
7 - - ldt f12,(r21)
8 ldt f13,(r1) addt f0,f1,f0
9 ldt f14,(r2) addt f10,f11,f10
10 ldt f15,(r3) - -
11 - - addt f12,f13,f12
12 ldq r4,24(sp) - -
13 - - addt f14,f15,f14
14 - - - -
15 mult f0,f10,f0 - -
16 - - - -
17 - - - -
18 - - - -
19 - - mult f12,f14,f12
20 stt f0,(r20) - -
21 - - - -
22 - - - -
23 - - stt f12,(r4)
24 ret r26
|