CmpE 110

Homework 7

Due Monday, November 24, 2003
 

Data Hazards, Branch Hazards and Prediction and Caches

Figure 1: Pipeline Implementation

                   add $8, $0, $0

                   add $7, $0, $0

                   addi $17, $0, 5

mainloop:          lw $2, 12($1)

                   addi $3, $2, 32

                   add $4, $3, $2

                   add $5, $3, $4

                   addi $16, $0, 5

innerloop:         lw  $6, 16($1)

                   add $7, $7, $6

                   sub $8, $8, $2

                   addi $16, $16, -1

                   bne $0, $16, innerloop

                   add $9, $7, $4

                   sub $10, $5, $8

                   sw $19, 12($1) 

                   sw $10, 16($1)

                   addi $17, $17, -1

                   bne $0, $17, mainloop

end:               addi $20, $0, 40

                   addi $21, $0, 55

Code Segment 1: A nested loop

Problems:

1.) Using code segment 1 and figure 1, identify all data dependencies that are also hazards, and indicate which can be resolved by forwarding, stalling, or both.  In your answer indicate the total number of stalls caused by these non-control hazards. (1 point)

Data Dependencies         H = Hazard S = avoided by Stalling F = avoided by Forwarding

add $8, $0, $0: sub $8, $8, $2

add $7, $0, $0: add $7, $7, $6

addi $17, $0, 5: addi $17, $17, -1

lw $2, 12($1): addi $3, $2, 32  H,  F, S (5 stalls total)

lw $2, 12($1): add $4, $3, $2

lw $2, 12($1): sub $8, $8, $2

addi $3, $2, 32: add $4, $3, $2 H, F

addi $3, $2, 32: add $5, $3, $4 H, F

add $4, $3, $2:  add $5, $3, $4 H, F

add $4, $3, $2:  add $9, $7, $4

add $5, $3, $4:  sub $10, $5, $8

addi $16, $0, 5: addi $16, $16, -1

lw  $6, 16($1): add $7, $7, $6 H, F, S   (25 stalls total)

add $7, $7, $6: add $9, $7, $4

sub $8, $8, $2: sub $10, $5, $8

addi $16, $16, -1: bne $0, $16, innerloop H, F

sub $10, $5, $8: sw $10, 16($1) H, F *The Result needs to be forwarded to the ID/EX.Dout2 or EX/MEM.Dout2 register because the value at register address Rt will be incorrect when its initially read in ID.

addi $17, $17, -1: bne $0, $17, mainloop H, F

 

Total Stalls = 30 stalls for the entire length of the code

2.) Using code segment 1, figure 1, and a prediction not-taken (PNT) scheme calculate the total number of wasted clock cycles for this program caused by using this scheme. Wasted clock cycles are caused by flushing the pipeline because of an incorrect prediction.  Assume branches are resolved in EX.   (1 point)

The first 4 loops of innerloop are taken and thus cause 8 wasted clock cycles during each of the 5 loops of the mainloop.  The first four loops of the main loop are taken and cause a total of 8 wasted clockcycles.  The total amount of wasted clock cycles from using PNT = 48 clock cycles. 

3.) Using code segment 1 and figure 1, calculate the execution time of the code segment.  Use the PNT scheme from problem 2 in calculating your answer.  Assume the clock frequency of the pipeline is 100 MHz.  (1 point)

Total execution time =  (#number of instructions executed + total number of stalls + depth of pipeline - 1) * clock cycle length

                                   =  (5(11 + 5(5)) + 5 + (48+30) + 5 -1)* 10ns = 2.670 micro seconds

4.)Using code segment 1 and figure 1,  act as the compiler and reorder the code to fill the branch delay slots without affecting program functionality.  Explain why each instruction can be used in the branch delay slot.   Assume the branches are resolved in EX.     Include in your answer the total number of unused delay slots. (1 point)

For the inner loop branch, we have only 2 possible instructions we could move, (add $7, $7, $6 : sub $8, $8, $2) because the only other instructions within the loop are not valid because the BNE is dependent on them.

The add $7, $7, $6 instruction is a great choice because of the lack of dependency with the BNE instruction and it removes the stall that would occur with the previous lw instruction which removes 25 stalls from the program.

The sub $8, $8, $2 is great for its lack of dependency on the BNE and other instructions.

For the main loop branch, we have only 2 possible instructions, (sw $19, 12($1):sw $10, 16($1)), any remaining instructions have dependencies with instructions that come after them.

Both of these instructions do not cause a dependency with the BNE instruction.

Total number of wasted cycles = 0 (or -25 for removing 25 other stalls)

5.) Using figure 1, and the reordered code from problem 4, calculate the execution time of the newly reordered code.  Assume a clock frequency of the pipeline is 100 MHz.  (1 point)

Total execution time =  (#number of instructions executed + total number of stalls + depth of pipeline - 1) * clock cycle length

                                   =  (5(11 + 5(5))  + 5 + (0+5) + 5 -1)* 10ns = 1.940 micro seconds

6.)  Using code segment 1, figure 1, and a dynamic 1-bit branch predictor scheme initially set to 0 (prediction not taken), calculate the total number of wasted clock cycles for this program.  Wasted clock cycles are caused by flushing the pipeline because of an incorrect prediction.   Assume that a branch prediction buffer stores the 1 bit branch predictor, and the target address.  The buffer is referenced during the IF stage.  This means you have the predicted address, either PC+4 or the branches target address, at the end of IF.  This means the pipeline will not have to stall.  Assume the branch is resolved in the EX stage.  (1 point)

Incorrect prediction = 2 wasted clock cycles

The first innerloop BNE of the first mainloop predicts incorrectly.  The prediction bit flips, and subsequent predictions are predicted as branch taken.  The last inner loop will also predict incorrectly.  (4 wasted cycles per mainloop)

The initial mainloop BNE will predict incorrectly.  The prediction bit flips and subsequent predictions are predicted as branch taken.  The last mainloop will also predict incorrectly. (4 wasted cycles)

Total number of wasted cycles = 24 cycles

7.) Using code segment 1 and figure 1, calculate the execution time of the code segment.  Use the 1-bit dynamic branch prediction scheme from problem 6 in calculating your answer.  Assume the clock frequency of the pipeline is 100 MHz. (1 point)

Total execution time =  (#number of instructions executed + total number of stalls + depth of pipeline - 1) * clock cycle length

                                   =  (5(11 + 5(5)) + 5+ (24+30) + 5 -1)* 10ns = 2.430 microseconds

8.)  Using code segment 1, figure 1, and a dynamic 2-bit branch predictor initially set to 00 (prediction not taken for at least 2 times); calculate the total number of wasted clock cycles for this program. Wasted clock cycles are caused by flushing the pipeline because of an incorrect prediction. . Assume that a branch prediction buffer stores the 2-bit branch predictor, and the target address.  The buffer is referenced during the IF stage.  This means you have the predicted address, either PC+4 or the branches target address, at the end of IF.  This means the pipeline will not have to stall.  Assume the branch is resolved in the EX stage. (1 point)

Incorrect prediction = 2 wasted clock cycles

The first and second innerloop BNE will predict incorrectly, before switching the prediction to taken.  Each final innerloop will then predict incorrectly, but will not change the prediction.  So the next mainloop, the prediction will remain as taken the rest of the programs length.  (4 + 5(2) = 14 wasted cycles)

The first and second mainloop BNE, will predict incorrectly before changing the prediction to taken.  On the final loop, the prediction will be incorrect.  (6 total wasted clock cycles)

Total number of wasted cycles = 20 clock cycles

9.) Using code segment 1 and figure 1, calculate the execution time of the code segment.  Use the 2-bit dynamic branch prediction scheme from problem 8 in calculating your answer.  Assume the clock frequency of the pipeline is 100 MHz.  (1 point)

Total execution time =  (#number of instructions executed + total number of stalls + depth of pipeline - 1) * clock cycle length

                                   =  (5(11 + 5(5)) + 5 + (20+30) + 5 -1)* 10ns = 2.390 micro seconds

10.) An example cache has a 4-way set associativity with 4096 cache lines, with a data block holding 4 words, and a write-back policy.  How many bits are needed to build the entire cache?  (1 point) 

4 words = 16 bytes = 128 bits per data block

The management field has 1 valid bit, 1 dirty bit and a Tag field = #address bits - log2 (# of sets) - log2 (# of bytes to a data word) - log2(# of data words in a data block)

Management field = 1 + 1 + #addressbits - 10 - 2 - 2 = #addressbits - 12

Each cache line = 128 + #of addressbits - 12 = 116 + #of addressbits

Total cache size = #number of bits in cache line * # of cache lines

                         = 475,136 + 4096 *#of address bits

Since it was not explicitly stated about the number of address bits, you may have assumed 32 bits.  Thus the total cache size would be 606,208 bits.