CmpE 110

Homework 7

Due Monday, November 24, 2003
 

Data Hazards, Branch Hazards and Prediction and Caches

Figure 1: Pipeline Implementation

                   add $8, $0, $0

                   add $7, $0, $0

                   addi $17, $0, 5

mainloop:          lw $2, 12($1)

                   addi $3, $2, 32

                   add $4, $3, $2

                   add $5, $3, $4

                   addi $16, $0, 5

innerloop:         lw  $6, 16($1)

                   add $7, $7, $6

                   sub $8, $8, $2

                   addi $16, $16, -1

                   bne $0, $16, innerloop

                   add $9, $7, $4

                   sub $10, $5, $8

                   sw $19, 12($1)

                   sw $10, 16($1)

                   addi $17, $17, -1

                   bne $0, $17, mainloop

end:               addi $20, $0, 40

                   addi $21, $0, 55

Code Segment 1: A nested loop

Problems:

1.) Using code segment 1 and figure 1, identify all data dependencies that are also hazards, and indicate which can be resolved by forwarding, stalling, or both.  In your answer indicate the total number of stalls caused by these non-control hazards. (1 point)

2.) Using code segment 1, figure 1, and a prediction not-taken (PNT) scheme calculate the total number of wasted clock cycles for this program caused by using this scheme. Wasted clock cycles are caused by flushing the pipeline because of an incorrect prediction.  Assume branches are resolved in EX.   (1 point)

3.) Using code segment 1 and figure 1, calculate the execution time of the code segment.  Use the PNT scheme from problem 2 in calculating your answer.  Assume the clock frequency of the pipeline is 100 MHz.  (1 point)

4.)Using code segment 1 and figure 1,  act as the compiler and reorder the code to fill the branch delay slots without affecting program functionality.  Explain why each instruction can be used in the branch delay slot.   Assume the branches are resolved in EX.     Include in your answer the total number of unused delay slots. (1 point)

5.) Using figure 1, and the reordered code from problem 4, calculate the execution time of the newly reordered code.  Assume a clock frequency of the pipeline is 100 MHz.  (1 point)

6.)  Using code segment 1, figure 1, and a dynamic 1-bit branch predictor scheme initially set to 0 (prediction not taken), calculate the total number of wasted clock cycles for this program.  Wasted clock cycles are caused by flushing the pipeline because of an incorrect prediction.   Assume that a branch prediction buffer stores the 1 bit branch predictor, and the target address.  The buffer is referenced during the IF stage.  This means you have the predicted address, either PC+4 or the branches target address, at the end of IF.  This means the pipeline will not have to stall.  Assume the branch is resolved in the EX stage.  (1 point)

7.) Using code segment 1 and figure 1, calculate the execution time of the code segment.  Use the 1-bit dynamic branch prediction scheme from problem 6 in calculating your answer.  Assume the clock frequency of the pipeline is 100 MHz. (1 point)

8.)  Using code segment 1, figure 1, and a dynamic 2-bit branch predictor initially set to 00 (prediction not taken for at least 2 times); calculate the total number of wasted clock cycles for this program. Wasted clock cycles are caused by flushing the pipeline because of an incorrect prediction. . Assume that a branch prediction buffer stores the 2-bit branch predictor, and the target address.  The buffer is referenced during the IF stage.  This means you have the predicted address, either PC+4 or the branches target address, at the end of IF.  This means the pipeline will not have to stall.  Assume the branch is resolved in the EX stage. (1 point)

9.) Using code segment 1 and figure 1, calculate the execution time of the code segment.  Use the 2-bit dynamic branch prediction scheme from problem 8 in calculating your answer.  Assume the clock frequency of the pipeline is 100 MHz.  (1 point)

10.) An example cache has a 4-way set associativity with 4096 cache lines, with a data block holding 4 words, and a write-back policy.  How many bits are needed to build the entire cache?  (1 point)