CmpE 110 HW 6

Date Due: Monday, Nov. 17, 2003 Beginning of Class

Timing Analysis of the 3 Data Paths

Figure 1: Single Cycle Implementation from slide 8-18

Figure 2: Multi-Cycle Implementation from slide 9-7

Figure 3: Pipeline Implementation from slide 10-8

Memory Reads or Writes take 3ns.  Register File Reads or Writes take 2ns.   Register (data or pipeline) Writes take 1ns.  The ALU takes 4ns to propagate a result.  Multiplexors and simple adders take 1ns to propagate an input.  Shifting and sign extending is negligible time. 

Code Segment 1:   

BeginLoop: lw $1, 100($10);

        lw $2, 108($10);

        addi $11, $2, 50;

        lw $3, 104($10);

        addi $12, $3, 55;

        addi $13, $1, 60;

        sub $14, $11, $3;

        add $15, $12, $1;

        add $16, $13, $2;

        slti $17, $2, 275;

        addi $18, $0, 1;

        sw $14, 100($10);

        sw $15, 104($10);

        sw $16, 108($10);

        beq $17, $18, BeginLoop;

Assume initially that Mem[100 + $10] = 25, Mem[104 + $10] = 45, Mem[108 + $10] = 10;

Course-relevant instruction set : add, sub, and, or, addi, andi, ori, beq, j, lw, sw, slt, slti

Questions

1.) For Single Cycle implementation, calculate the clock period length, and its corresponding maximum frequency using the given timing parameters listed above for all course-relevant instructions. (1/2 point each)

Clock period length = PC read (not required, but not wrong) + Instruction Memory read + Register File read + ALUsrcB MUX + ALU + Data Memory Read or Write (not both) + Mem2Reg Mux +  Register File Write

*All PC calculations are done in parallel and hidden by other latencies.  The RegDst Mux is also hidden by latencies.

Clock period length = 1ns + 3ns + 2ns + 1ns + 4ns + 3ns + 1ns + 2ns

Clock period length =  16 ns or 17ns (with PC register Read only)

Frequency =  1 / Clock period length = 62.5 MHz or 58.8 MHz

2.) For Multi-Cycle implementation, calculate the clock period length, and its corresponding maximum frequency using the given timing parameters listed above for all course-relevant instructions.  Also compute the CPI for each course-relevant instruction using this implementation. (3 points)

IF Stage (same for every instruction)

Fetch length = PC read (optional) + IorD Mux + Memory Read + IR Register Write

                   = 1ns (optional) + 1ns + 3ns + 1ns = 5ns or 6ns (w/ PC read only)

PC + 4 calculation in IF stage = PC read (optional) + ALUsrc (A and B) + ALU + PCsrc Mux + PC Register Write

                                              =  1ns (optional) + 1ns + 4ns + 1ns + 1ns = 7ns or 8ns (w/ PC read only)

IF Stage clock length = 7ns or 8ns (w/ PC read only)

ID Stage (same for every instruction)

Decode length = IR Register Read (optional) + Register File Read + A/B Register Write

                       = 1ns (optional) + 2ns + 1ns = 3ns or 4ns (w/ IR read only)

BT address Calculation time = IR/PC Register Read (optional) + ALUsrc(A and B) Mux + ALU + ALUOut Register Write

                                            = 1 ns (optional) + 1ns + 4ns + 1ns = 6ns or 7ns (w/ PC/IR Register Read Only)

ID stage Clock Length = 6ns or 7ns  (w/ PC/IR Register Read Only)

EX Stage (Varies on instruction)

Arith/Logic calculation time = A/B Register Read (optional) + ALUsrc (A and B) Mux + ALU + ALUOut Register Write

                                          = 1ns (optional) + 1 ns + 4ns + 1ns = 6ns or 7ns (w/ A/B Register Read Only)

Load/Store Addr Calculation = A/IR Register Read (optional) + ALUsrc (A and B) Mux + ALU + ALUOut Register Write

                                             = 1ns (optional) + 1 ns + 4ns + 1ns = 6ns or 7ns (w/ A/IR Register Read Only)

Branch Completion Time = A/B Register Read (optional) + ALUsrc (A and B) + ALU + PCsrc Mux + PC Register Write

                                      = 1ns (optional) + 1ns + 4ns + 1ns + 1ns = 7ns or 8ns (w/ A/B Register Read Only)

Jump Completion Time = IR Register Read (optional) + PCsrc Mux + PC Write

                                    = 1ns (optional) + 1ns + 1ns = 2ns or 3ns (w/ IR Register Read Only)

Ex Stage Clock length = 7ns or 8ns (w/ A/B Register Read Only)

MEM Stage (only occurs with Loads and Stores)

Load Time = ALUOut Register Read (optional) + IorD Mux + Memory Read + MDR Register Write

                  = 1ns (optional) + 1ns + 3ns + 1ns = 5ns or 6ns (w/ ALUOut Register Read Only)

Store Time = ALUOut/B Register Read (optional) + IorD Mux + Memory Write

                 = 1ns (optional) + 1ns + 3ns = 4ns or 5ns (w/ ALUOut/B Register Read Only)

Mem Stage Clock length = 5ns or 6ns (w/ ALUOut Register Read Only)

WB Stage (only occurs with Loads and Arith/Logic instructions)

WB time = ALUOut/MDR/IR Register Read (optional) + RegDst/Mem2Reg Mux + Register File Write

              = 1ns (optional) + 1ns + 2ns = 3ns or 4ns (w/ ALUOut/MDR/IR Register Read Only)

WB Clock length = 3ns or 4ns  (w/ ALUOut/MDR/IR Register Read Only)

To get the multicycle clock period length you choose the stage with the longest clock length, which is the EX/IF stage.

The Clock Period Length is therefore 7ns, or 8ns if you included Register Reads.

Therefore, the Frequency = 1 / Clock Period Length = 142.8 MHz or 125 MHz with Register Reads having a 1ns delay

CPI

Arith/Logic Instructions = IF + ID + EX + WB = 4 clock cycles

Load  Instructions = IF + ID + EX + MEM + WB = 5 clock cycles

Store Instructions = IF + ID + EX + MEM = 4 clock cycles

Branches = IF + ID + EX  = 3 clock cycles

Jumps = IF + ID + EX = 3 clock cycles

3.) For the Pipeline implementation, calculate the clock period length and its corresponding maximum frequency so that each pipeline stage may finish its work.  Please note that a pipeline implementation cycle ends by writing data to the pipeline register. (1/2 point each)

IF Stage

IF stage length = PC Register Read (optional) +  Memory Read + IF/ID Pipeline Register Write;  PC + 4 calcuation takes 3ns which is hidden by the memory read

                          = 1ns (optional) + 3ns + 1ns = 4ns or 5ns (with PC Register Read Only)

ID / WB Stage (The WB stage for a previous instruction overlaps with the ID stage of current instruction, so WB happens, then ID)

ID/WB stage length = IF/ID or MEM/WB Pipeline Register Read (optional) + WBsrc Mux + Register File Write (WB stage) + Register File Read + ID/EX Pipeline Register Write.

                                  = 1ns (optional) + 1ns + 2ns + 2ns + 1ns = 6ns or 7ns (with IF/ID MEM/WB Pipeline Register Read only)

EX Stage

EX stage length = ID/EX Pipeline Register Read (optional) + ALUsrc mux + ALU + EX/MEM Pipeline Register Write

                           = 1ns (optional) + 1ns + 4ns + 1ns = 6ns or 7ns (with ID/EX Pipeline Register Read)

 Mem Stage

Mem Stage length = EX/MEM Pipeline Register Read (optional) + Memory Read or Write + MEM/WB Pipeline Register Write

                               = 1ns (optional) + 3ns + 1ns  = 4ns or 5ns (with EX/MEM Pipeline Register Read only)

 

To determine Period length, we must look at our longest stage(s), which are ID and EX with a clock period length of 6ns or 7ns (with a Pipeline Register Read)

Frequency = 1 / Clock Period Length = 166.67 MHz (for 6ns) and 142.8 MHz (for 7ns)

4.) Using Code Segment 1 and answers from above, determine the execution time of the program for each implementation.  The pipeline is finished when no more instructions remain within the pipeline.   *note for the pipeline implementation you may use forwarding to avoid wasted clock cycles    (3 points)

The program Runs through its instructions 5 whole times, before the branch is not taken.  Total Instructions = 5 * 15 = 75

For Single Cycle Implementation the Total execution time = number of instructions executed * clock cycles per instruction (which is 1) * clock cycle length

                                                                                       = 75 * 1 * 16ns (w/o Register Reads) = 1.2 micro seconds

                                                                                       = 75 * 1 * 17ns (w/ Register Reads) = 1.275 micro seconds

For Multi-Cycle Implementation the Total Execution Time = (number of Arith/Logic executed * CPI + number of Loads * CPI + number of Stores * CPI + number of Branches * CPI) * clock cycle length

                                                                                         = (40 * 4 + 15 * 5 + 15 * 4 + 5* 3) * 7ns (w/o Register Reads) = 2.17 micro seconds

                                                                                         = (40 * 4 + 15 * 5 + 15 * 4 + 5* 3) * 8ns (w/ Register Reads) = 2.48 micro seconds

For Pipeline Implementation the Total Execution Time = (number of instructions + number of stalls generated by hazards + depth of pipeline - 1) * clock cycle length

As stated on my post, you should try and fill the branch delay slots ( because branches are still calculated in EX).  With our code, you can fill the branch delay slots with the previous 2 store word instructions since they will not generate any dependency hazards on the subsequent loads when the branch is taken and does not cause a dependency with the branch instruction.

However, filling branch delay slots were not covered in time for this assignment.  So each branch delay slot is filled with a bubble.

Each loop also has 2 load hazards where the following instruction needs the results of load earlier then we can forward it.  Thus each load hazard generates 1 stall.  And each branch generates 2 stalls in the first 4 iterations, and none in the last loop.  This is because the branch instruction is the end of our program.  So we have 4 * 4 stalls for the first 4 iterations, and 2 stalls for the last, for a total of 18.

                 Total Execution Time = (75 + 18 + 5 -1) * 6ns (w/o Register Reads) = .582 microseconds

                 Total Execution Time = (75 + 18 + 5 - 1) * 7ns (w/ Register Reads)   = .679 microseconds

5.) Using the Pipeline implementation and Code Segment 1.  What is the value of EX/MEM.ALUresult pipeline register during the 8th clock cycle?  (1 point)

 

1 CC

 2 CC

 3 CC

 4 CC

 5 CC 

6 CC

 7 CC

8 CC

lw $1, 100($10)  IF ID EX MEM WB      
lw $2, 108($10)   IF ID EX MEM WB    
addi $11, $2, 50     IF ID Bubble EX MEM WB
lw $3, 104($10)       IF Bubble ID EX MEM
addi $12, $3, 55         Bubble IF ID Bubble
addi $13, $1, 60             IF Bubble
sub $14, $11, $3               Bubble

During the 8th CC, the EX/MEM.ALUresult register would hold the address calculation of the lw $3, 104($10) instruction.

So the value = 104 + Reg[$10];

6.) Using the Pipeline implementation and Code Segment 1.  What is the value of ID/EX.Read1out pipeline register during the 10th clock cycle?          (1 point)

 

 

1 CC

 2 CC

 3 CC

 4 CC

 5 CC 

6 CC

 7 CC

8 CC

 9 CC 10 CC
lw $1, 100($10)  IF ID EX MEM WB          
lw $2, 108($10)   IF ID EX MEM WB        
addi $11, $2, 50     IF ID Bubble EX MEM WB    
lw $3, 104($10)       IF Bubble ID EX MEM WB  
addi $12, $3, 55         Bubble IF ID Bubble  EX MEM
addi $13, $1, 60             IF Bubble ID EX
sub $14, $11, $3               Bubble IF ID

During the 10th CC, the ID/EX.Read1out pipeline register would hold the Rs operand of addi $13, $1, 60.

This value was loaded in initally with the first load.  The value = 25