next up previous


CE202 - Computer Architecture
Problem Set 2


Here's part of last year's midterm.

1.
(80 points) The McKinley implementation of IA64 has an 8-stage core (integer) pipeline. The phases can approximately be called Instruction Fetch (IF), Instruction Issue 1 (IS1), IS2, IS3, Register Read (RF), Execute and L1 cache access (EXE), Exception Detect and Branch Correction (DET), and Write Back (WB). IS1, IS2, and IS3 are concerned with analyzing a packet of intructions for issue, and register renaming. Branch targets are calculated in IS1. Branches are resolved in DET. L1 cache access, instruction or data, takes a single cycle. After the REG cycle, floating-point instructions have stages FP1, FP2, FP3, FP4, and FWB, replacing EX, DET, and WB.

Consider the instructions Jump (J), Jump Register (JR), Branch, Integer, FP, Load, and Store.

(a)
(10 points) What data and control forwarding is needed? Use an IF1 IS1 ...diagram to show several instructions executing and label and comment on each arrow. Assume single instruction issue.

(b)
(10 points) Make a symbolic pipeline diagram by showing the IF...RF stages (with latches in between), and a split into two possible pipelines: EX...WB and FP1...FWB. Draw forwarding paths between the stages for data and control forwarding.

(c)
(10 points) What data stalls remain? Provide a neat list of the form ``Instructions X, Y, and Z with result R1 followed by W or X with source R1: 2 intervening clock cycles.'' (Or more compactly, {X,Y,Z} followed by {W,X}: 2.)

(d)
(10 points) Control hazards. Assume 10% of instructions are branches, 2% are jumps and calls, and 2% are returns (jr). The McKinley has a Branch Target Buffer (BTB) and a Branch Prediction Table (BPT). The BPT is 90% accurate. Assume that both correctly and incorrectly predicted branches are 60% taken. The Branch Target Buffer has a hit rate of 70%, regardless of the accuracy of the prediction. What are the costs in clocks for the various branch possibilities (ie, presense or absense in BTB, correct or incorrect BPT, taken or not taken)? What is the average cost in clocks for a control statement?

(e)
(10 points) McKinley tries to issue 6 instructions per clock cycle (ideal IPC = 6). What is the reduction in IPC considering only control hazards? Next, assume that of the remaining 86% of instructions, only 50% of the peak issue rates can be achieved (roughly based on Figure 4.57) due to structural and data hazard considerations. What is the overall IPC of the machine?
(f)
(10 points) If a cache miss happens, an additional 8 cycles are required to access the on-chip 256K L2 cache. The L1 caches are 16K. If L2 fits all memory necessary for a program, and the instruction cache miss rate is 0.4%, and the data cache miss rate is 6% (estimated from table 5.7 for a 16K cache), what is IPC considering cache misses? Assume that instruction cache misses fully stall the pipeline, but that there are sufficient reservation stations so that other instructions may execute during a data cache miss.

(g)
(10 points) Consider the merging of DET and WB. What would the implications of this be? What issues would come up in deciding whether or not to merge the two stages?

(h)
(10 points) Why is instruction issue rate insufficient for comparing different processors?

(i)
(Extra credit) Design a VLSI layout for this architecture.

2.
(10 points) Dynamic instruction scheduling 

Consider a Tomosulo pipeline (similar to PowerPC 620 in text) with stages IF, ID, IS (the process of moving instructions to reservation stations), EX (1 cycle integer, 2-cycle LSU, 3-cycle FP mult or add), and WB (commit). There is one integer unit with two reservation stations (I1, I2), 1 FP unit with two reservation stations (F1, F2), 1 Load/Store unit with two reservation stations (LS1, LS2), and one Branch unit with 2 reservation stations (B1, B2) that takes 1 cycle.

Indicate the clock cycle for each of the following instructions would be in the various stages. Assume 1 instruction issue per cycle and serparate FP and integer result busses. Assume that

In the RS# column indicate the reservation station slot (e.g., F1, I2, B1) that was used. For an instruction already in a reservation station, EX1 can commence during the same cycle as the common data bus write in WB. EX2 and EX3 are not used for all instructions -- leave blank for those that do not need them. A new instruction can be loaded into a reservation station when the old one is in WB -- WB and IS can overlap.

  IF ID IS EX1 EX2 EX3 WB RS #
foo: LD F2, 0(R1)                
MULTD F5, F2, F2                
MULTD F4, F5, F0                
LD F6, 0(R2)                
ADDD F6, F4, F6                
SD 0(R2), F6                
ADDI R1, R1, #8                
ADDI R2, R2, #8                
SGTI R3, R1, done                
BNEQZ R3, foo                


Spare diagram:


  IF ID IS EX1 EX2 EX3 WB RS #
foo: LD F2, 0(R1)                
MULTD F5, F2, F2                
MULTD F4, F5, F0                
LD F6, 0(R2)                
ADDD F6, F4, F6                
SD 0(R2), F6                
ADDI R1, R1, #8                
ADDI R2, R2, #8                
SGTI R3, R1, done                
BNEQZ R3, foo                

3.
(10 points) How can performance be improved in problem 2?


next up previous
Richard Hughey
2003-10-09