CmpE 110 HW 6
Date Due: Monday, Nov. 17, 2003 Beginning of Class
Timing Analysis of the 3 Data Paths

Figure 1: Single Cycle Implementation from slide 8-18

Figure 2: Multi-Cycle Implementation from slide 9-7

Figure 3: Pipeline Implementation from slide 10-8
Memory Reads or Writes take 3ns. Register File Reads or Writes take 2ns. Register (data or pipeline) Writes take 1ns. The ALU takes 4ns to propagate a result. Multiplexors and simple adders take 1ns to propagate an input. Shifting and sign extending is negligible time.
Code Segment 1:
BeginLoop: lw $1, 100($10);
lw $2, 108($10);
addi $11, $2, 50;
lw $3, 104($10);
addi $12, $3, 55;
addi $13, $1, 60;
sub $14, $11, $3;
add $15, $12, $1;
add $16, $13, $2;
slti $17, $2, 275;
addi $18, $0, 1;
sw $14, 100($10);
sw $15, 104($10);
sw $16, 108($10);
beq $17, $18, BeginLoop;
Assume initially that Mem[100 + $10] = 25, Mem[104 + $10] = 45, Mem[108 + $10] = 10;
Course-relevant instruction set : add, sub, and, or, addi, andi, ori, beq, j, lw, sw, slt, slti
Questions
1.) For Single Cycle implementation, calculate the clock period length, and its corresponding maximum frequency using the given timing parameters listed above for all course-relevant instructions. (1/2 point each)
Clock period length = PC read (not required, but not wrong) + Instruction Memory read + Register File read + ALUsrcB MUX + ALU + Data Memory Read or Write (not both) + Mem2Reg Mux + Register File Write
*All PC calculations are done in parallel and hidden by other latencies. The RegDst Mux is also hidden by latencies.
Clock period length = 1ns + 3ns + 2ns + 1ns + 4ns + 3ns + 1ns + 2ns
Clock period length = 16 ns or 17ns (with PC register Read only)
Frequency = 1 / Clock period length = 62.5 MHz or 58.8 MHz
2.) For Multi-Cycle implementation, calculate the clock period length, and its corresponding maximum frequency using the given timing parameters listed above for all course-relevant instructions. Also compute the CPI for each course-relevant instruction using this implementation. (3 points)
IF Stage (same for every instruction)
Fetch length = PC read (optional) + IorD Mux + Memory Read + IR Register Write
= 1ns (optional) + 1ns + 3ns + 1ns = 5ns or 6ns (w/ PC read only)
PC + 4 calculation in IF stage = PC read (optional) + ALUsrc (A and B) + ALU + PCsrc Mux + PC Register Write
= 1ns (optional) + 1ns + 4ns + 1ns + 1ns = 7ns or 8ns (w/ PC read only)
IF Stage clock length = 7ns or 8ns (w/ PC read only)
ID Stage (same for every instruction)
Decode length = IR Register Read (optional) + Register File Read + A/B Register Write
= 1ns (optional) + 2ns + 1ns = 3ns or 4ns (w/ IR read only)
BT address Calculation time = IR/PC Register Read (optional) + ALUsrc(A and B) Mux + ALU + ALUOut Register Write
= 1 ns (optional) + 1ns + 4ns + 1ns = 6ns or 7ns (w/ PC/IR Register Read Only)
ID stage Clock Length = 6ns or 7ns (w/ PC/IR Register Read Only)
EX Stage (Varies on instruction)
Arith/Logic calculation time = A/B Register Read (optional) + ALUsrc (A and B) Mux + ALU + ALUOut Register Write
= 1ns (optional) + 1 ns + 4ns + 1ns = 6ns or 7ns (w/ A/B Register Read Only)
Load/Store Addr Calculation = A/IR Register Read (optional) + ALUsrc (A and B) Mux + ALU + ALUOut Register Write
= 1ns (optional) + 1 ns + 4ns + 1ns = 6ns or 7ns (w/ A/IR Register Read Only)
Branch Completion Time = A/B Register Read (optional) + ALUsrc (A and B) + ALU + PCsrc Mux + PC Register Write
= 1ns (optional) + 1ns + 4ns + 1ns + 1ns = 7ns or 8ns (w/ A/B Register Read Only)
Jump Completion Time = IR Register Read (optional) + PCsrc Mux + PC Write
= 1ns (optional) + 1ns + 1ns = 2ns or 3ns (w/ IR Register Read Only)
Ex Stage Clock length = 7ns or 8ns (w/ A/B Register Read Only)
MEM Stage (only occurs with Loads and Stores)
Load Time = ALUOut Register Read (optional) + IorD Mux + Memory Read + MDR Register Write
= 1ns (optional) + 1ns + 3ns + 1ns = 5ns or 6ns (w/ ALUOut Register Read Only)
Store Time = ALUOut/B Register Read (optional) + IorD Mux + Memory Write
= 1ns (optional) + 1ns + 3ns = 4ns or 5ns (w/ ALUOut/B Register Read Only)
Mem Stage Clock length = 5ns or 6ns (w/ ALUOut Register Read Only)
WB Stage (only occurs with Loads and Arith/Logic instructions)
WB time = ALUOut/MDR/IR Register Read (optional) + RegDst/Mem2Reg Mux + Register File Write
= 1ns (optional) + 1ns + 2ns = 3ns or 4ns (w/ ALUOut/MDR/IR Register Read Only)
WB Clock length = 3ns or 4ns (w/ ALUOut/MDR/IR Register Read Only)
To get the multicycle clock period length you choose the stage with the longest clock length, which is the EX/IF stage.
The Clock Period Length is therefore 7ns, or 8ns if you included Register Reads.
Therefore, the Frequency = 1 / Clock Period Length = 142.8 MHz or 125 MHz with Register Reads having a 1ns delay
CPI
Arith/Logic Instructions = IF + ID + EX + WB = 4 clock cycles
Load Instructions = IF + ID + EX + MEM + WB = 5 clock cycles
Store Instructions = IF + ID + EX + MEM = 4 clock cycles
Branches = IF + ID + EX = 3 clock cycles
Jumps = IF + ID + EX = 3 clock cycles
3.) For the Pipeline implementation, calculate the clock period length and its corresponding maximum frequency so that each pipeline stage may finish its work. Please note that a pipeline implementation cycle ends by writing data to the pipeline register. (1/2 point each)
IF Stage
IF stage length = PC Register Read (optional) + Memory Read + IF/ID Pipeline Register Write; PC + 4 calcuation takes 3ns which is hidden by the memory read
= 1ns (optional) + 3ns + 1ns = 4ns or 5ns (with PC Register Read Only)
ID / WB Stage (The WB stage for a previous instruction overlaps with the ID stage of current instruction, so WB happens, then ID)
ID/WB stage length = IF/ID or MEM/WB Pipeline Register Read (optional) + WBsrc Mux + Register File Write (WB stage) + Register File Read + ID/EX Pipeline Register Write.
= 1ns (optional) + 1ns + 2ns + 2ns + 1ns = 6ns or 7ns (with IF/ID MEM/WB Pipeline Register Read only)
EX Stage
EX stage length = ID/EX Pipeline Register Read (optional) + ALUsrc mux + ALU + EX/MEM Pipeline Register Write
= 1ns (optional) + 1ns + 4ns + 1ns = 6ns or 7ns (with ID/EX Pipeline Register Read)
Mem Stage
Mem Stage length = EX/MEM Pipeline Register Read (optional) + Memory Read or Write + MEM/WB Pipeline Register Write
= 1ns (optional) + 3ns + 1ns = 4ns or 5ns (with EX/MEM Pipeline Register Read only)
To determine Period length, we must look at our longest stage(s), which are ID and EX with a clock period length of 6ns or 7ns (with a Pipeline Register Read)
Frequency = 1 / Clock Period Length = 166.67 MHz (for 6ns) and 142.8 MHz (for 7ns)
4.) Using Code Segment 1 and answers from above, determine the execution time of the program for each implementation. The pipeline is finished when no more instructions remain within the pipeline. *note for the pipeline implementation you may use forwarding to avoid wasted clock cycles (3 points)
The program Runs through its instructions 5 whole times, before the branch is not taken. Total Instructions = 5 * 15 = 75
For Single Cycle Implementation the Total execution time = number of instructions executed * clock cycles per instruction (which is 1) * clock cycle length
= 75 * 1 * 16ns (w/o Register Reads) = 1.2 micro seconds
= 75 * 1 * 17ns (w/ Register Reads) = 1.275 micro seconds
For Multi-Cycle Implementation the Total Execution Time = (number of Arith/Logic executed * CPI + number of Loads * CPI + number of Stores * CPI + number of Branches * CPI) * clock cycle length
= (40 * 4 + 15 * 5 + 15 * 4 + 5* 3) * 7ns (w/o Register Reads) = 2.17 micro seconds
= (40 * 4 + 15 * 5 + 15 * 4 + 5* 3) * 8ns (w/ Register Reads) = 2.48 micro seconds
For Pipeline Implementation the Total Execution Time = (number of instructions + number of stalls generated by hazards + depth of pipeline - 1) * clock cycle length
As stated on my post, you should try and fill the branch delay slots ( because branches are still calculated in EX). With our code, you can fill the branch delay slots with the previous 2 store word instructions since they will not generate any dependency hazards on the subsequent loads when the branch is taken and does not cause a dependency with the branch instruction.
However, filling branch delay slots were not covered in time for this assignment. So each branch delay slot is filled with a bubble.
Each loop also has 2 load hazards where the following instruction needs the results of load earlier then we can forward it. Thus each load hazard generates 1 stall. And each branch generates 2 stalls in the first 4 iterations, and none in the last loop. This is because the branch instruction is the end of our program. So we have 4 * 4 stalls for the first 4 iterations, and 2 stalls for the last, for a total of 18.
Total Execution Time = (75 + 18 + 5 -1) * 6ns (w/o Register Reads) = .582 microseconds
Total Execution Time = (75 + 18 + 5 - 1) * 7ns (w/ Register Reads) = .679 microseconds
5.) Using the Pipeline implementation and Code Segment 1. What is the value of EX/MEM.ALUresult pipeline register during the 8th clock cycle? (1 point)
|
|
1 CC |
2 CC |
3 CC |
4 CC |
5 CC |
6 CC |
7 CC |
8 CC |
| lw $1, 100($10) | IF | ID | EX | MEM | WB | |||
| lw $2, 108($10) | IF | ID | EX | MEM | WB | |||
| addi $11, $2, 50 | IF | ID | Bubble | EX | MEM | WB | ||
| lw $3, 104($10) | IF | Bubble | ID | EX | MEM | |||
| addi $12, $3, 55 | Bubble | IF | ID | Bubble | ||||
| addi $13, $1, 60 | IF | Bubble | ||||||
| sub $14, $11, $3 | Bubble |
During the 8th CC, the EX/MEM.ALUresult register would hold the address calculation of the lw $3, 104($10) instruction.
So the value = 104 + Reg[$10];
6.) Using the Pipeline implementation and Code Segment 1. What is the value of ID/EX.Read1out pipeline register during the 10th clock cycle? (1 point)
|
|
1 CC |
2 CC |
3 CC |
4 CC |
5 CC |
6 CC |
7 CC |
8 CC |
9 CC | 10 CC |
| lw $1, 100($10) | IF | ID | EX | MEM | WB | |||||
| lw $2, 108($10) | IF | ID | EX | MEM | WB | |||||
| addi $11, $2, 50 | IF | ID | Bubble | EX | MEM | WB | ||||
| lw $3, 104($10) | IF | Bubble | ID | EX | MEM | WB | ||||
| addi $12, $3, 55 | Bubble | IF | ID | Bubble | EX | MEM | ||||
| addi $13, $1, 60 | IF | Bubble | ID | EX | ||||||
| sub $14, $11, $3 | Bubble | IF | ID |
During the 10th CC, the ID/EX.Read1out pipeline register would hold the Rs operand of addi $13, $1, 60.
This value was loaded in initally with the first load. The value = 25