

#### **COMPUTER ARCHITECTURE**

Chapter 4 – Complex Pipelining

Prof. Dr.-Ing. Stefan Wallentowitz

Department 07 – Munich University of Applied Sciences



# **Course Organization**





Computer Architecture - Chapter 4 - Complex Pipelining



## **Course Organization**





Computer Architecture - Chapter 4 - Complex Pipelining







Goal: Bring IPC up (near to one or even above)



Goal: Bring IPC up (near to one or even above)

Speculative Execution





Goal: Bring IPC up (near to one or even above)

Speculative Execution

Branch prediction: Reduce the impact of branch decisions



Goal: Bring IPC up (near to one or even above)

Speculative Execution

- Branch prediction: Reduce the impact of branch decisions
- Other kinds of speculation: Address, data, ...





Goal: Bring IPC up (near to one or even above)

Speculative Execution

- Branch prediction: Reduce the impact of branch decisions
- Other kinds of speculation: Address, data, ...





Goal: Bring IPC up (near to one or even above)

#### Speculative Execution

- Branch prediction: Reduce the impact of branch decisions
- Other kinds of speculation: Address, data, ...

#### Parallelism

Instruction Level Parallelism (ILP)





Goal: Bring IPC up (near to one or even above)

#### Speculative Execution

- Branch prediction: Reduce the impact of branch decisions
- Other kinds of speculation: Address, data, ...

- Instruction Level Parallelism (ILP)
  - Pipelining





Goal: Bring IPC up (near to one or even above)

#### Speculative Execution

- Branch prediction: Reduce the impact of branch decisions
- Other kinds of speculation: Address, data, ...

- Instruction Level Parallelism (ILP)
  - Pipelining
  - ► Superscalar execution, out-of-order execution (lecture part 4)





Goal: Bring IPC up (near to one or even above)

#### Speculative Execution

- Branch prediction: Reduce the impact of branch decisions
- Other kinds of speculation: Address, data, ...

- Instruction Level Parallelism (ILP)
  - Pipelining
  - ► Superscalar execution, out-of-order execution (lecture part 4)
- Data parallelism





Goal: Bring IPC up (near to one or even above)

#### Speculative Execution

- Branch prediction: Reduce the impact of branch decisions
- Other kinds of speculation: Address, data, ...

- Instruction Level Parallelism (ILP)
  - Pipelining
  - ► Superscalar execution, out-of-order execution (lecture part 4)
- Data parallelism
  - Data vectors, single instruction multiple data





Goal: Bring IPC up (near to one or even above)

#### Speculative Execution

- Branch prediction: Reduce the impact of branch decisions
- Other kinds of speculation: Address, data, ...

- Instruction Level Parallelism (ILP)
  - Pipelining
  - ► Superscalar execution, out-of-order execution (lecture part 4)
- Data parallelism
  - Data vectors, single instruction multiple data
- Thread parallelism





Goal: Bring IPC up (near to one or even above)

#### Speculative Execution

- Branch prediction: Reduce the impact of branch decisions
- Other kinds of speculation: Address, data, ...

- Instruction Level Parallelism (ILP)
  - Pipelining
  - ► Superscalar execution, out-of-order execution (lecture part 4)
- Data parallelism
  - Data vectors, single instruction multiple data
- Thread parallelism
  - Execution of multiple different instruction streams











Assumption: Each instruction takes one cycle per stage





Assumption: Each instruction takes one cycle per stage

General exception: Memory accesses take multiple cycles





Assumption: Each instruction takes one cycle per stage

General exception: Memory accesses take multiple cycles







Assumption: Each instruction takes one cycle per stage

General exception: Memory accesses take multiple cycles

Implementation of execute stage:

Basically: Arithmetic and Logical Unit (ALU)







Assumption: Each instruction takes one cycle per stage

General exception: Memory accesses take multiple cycles

- Basically: Arithmetic and Logical Unit (ALU)
- But also:







Assumption: Each instruction takes one cycle per stage

General exception: Memory accesses take multiple cycles

- Basically: Arithmetic and Logical Unit (ALU)
- But also:
  - Branch offset ALU







Assumption: Each instruction takes one cycle per stage

General exception: Memory accesses take multiple cycles

- Basically: Arithmetic and Logical Unit (ALU)
- But also:
  - Branch offset ALU
  - Multiplier/Divider (RISC-V M extension)







Assumption: Each instruction takes one cycle per stage

General exception: Memory accesses take multiple cycles

- Basically: Arithmetic and Logical Unit (ALU)
- But also:
  - Branch offset ALU
  - Multiplier/Divider (RISC-V M extension)
  - Floating Point Unit (RISC-V F/D extension)



# **Pipeline: Functional Units**



## **Pipeline: Functional Units**





Split EX into functional units (FU): Different hardware building blocks



## **Pipeline: Functional Units**





Split EX into functional units (FU): Different hardware building blocks

Multicycle instructions: Instructions don't complete in one cycle: Multiplier, Divider, Floating Point Unit (FPU)









Multicycle FUs may be pipelined: Decomposition of operation







Multicycle FUs may be pipelined: Decomposition of operation

Sometimes not possible: DIV often shares one unit over multiple cycles







Multicycle FUs may be pipelined: Decomposition of operation

Sometimes not possible: DIV often shares one unit over multiple cycles

Allows for parallel execution of multiple instructions in one FU (not always the case)







Multicycle FUs may be pipelined: Decomposition of operation

Sometimes not possible: DIV often shares one unit over multiple cycles

Allows for parallel execution of multiple instructions in one FU (not always the case)

(note: In the diagram each block corresponds to one clock cycle, differently scaled)



# **Multicycle Metrics**







# **Multicycle Metrics**





Latency



Computer Architecture - Chapter 4 - Complex Pipelining

# **Multicycle Metrics**





#### Latency

• Minimal time for instruction to traverse a functional unit



Computer Architecture - Chapter 4 - Complex Pipelining

## **Multicycle Metrics**





#### Latency

Minimal time for instruction to traverse a functional unit

#### **Initiation Interval**



## **Multicycle Metrics**





#### Latency

Minimal time for instruction to traverse a functional unit

#### **Initiation Interval**

Minimal duration between two instructions can be started in a functional unit



## **Multicycle Metrics**





|               | Latency | <b>Initiation Interval</b> |
|---------------|---------|----------------------------|
| (Integer) ALU |         |                            |
| Multiplier    |         |                            |
| Divider       |         |                            |
| FPU           |         |                            |









Memory access can be optional as most operations don't use it







Memory access can be optional as most operations don't use it Specifically after split into functional units







In pipeline with FUs, MA can be an optional extra stage after ALU







In pipeline with FUs, MA can be an optional extra stage after ALU

• Other paths are not concerned







In pipeline with FUs, MA can be an optional extra stage after ALU

- Other paths are not concerned
- For ALU operations its optional to traverse MA







Similar as before: Only one instruction can be in any FU at any time



Similar as before: Only one instruction can be in any FU at any time Structural hazard for multicycle operations



Similar as before: Only one instruction can be in any FU at any time Structural hazard for multicycle operations

| xor x10, x1, x2 | FE | DE | ALU | WB  |     |     |     |    |    |
|-----------------|----|----|-----|-----|-----|-----|-----|----|----|
| mul x3, x7, x8  |    | FE | DE  | MUL | MUL | MUL | WB  |    |    |
| sw x3, 4(x10)   |    |    | FE  | DE  | DE  | DE  | ALU | MA | WB |





Issue at most one instruction per cycle





Issue at most one instruction per cycle

But: multicycle instructions may still be ongoing

| xor x10, x1, x2 | FE | DE | ALU | WB  |     |     |    |
|-----------------|----|----|-----|-----|-----|-----|----|
| mul x3, x7, x8  |    | FE | DE  | MUL | MUL | MUL | WB |
| addi x2, x2, 1  |    |    | FE  | DE  | ALU | WB  |    |



Issue at most one instruction per cycle

But: multicycle instructions may still be ongoing

| xor x10, x1, x2 | FE | DE | ALU | WB  |     |     |    |  |
|-----------------|----|----|-----|-----|-----|-----|----|--|
| mul x3, x7, x8  |    | FE | DE  | MUL | MUL | MUL | WB |  |
| addi x2, x2, 1  |    |    | FE  | DE  | ALU | WB  |    |  |

Due to different latency: Instructions can "overtake" others







Even when started in correct order, instructions can complete *out-of-order* 



Even when started in correct order, instructions can complete *out-of-order* 

Structural hazard on writeback stage, can be resolved



Even when started in correct order, instructions can complete *out-of-order* 

Structural hazard on writeback stage, can be resolved

Example:



Even when started in correct order, instructions can complete out-of-order

Structural hazard on writeback stage, can be resolved

#### Example:

| lw x3, 8(x2)   | FE | DE | ALU | MA  | MA  | MA | WB |
|----------------|----|----|-----|-----|-----|----|----|
| addi x2, x2, 1 |    | FE | DE  | ALU | WB  |    |    |
| bnez x2, loop  |    |    | FE  | DE  | ALU | WB |    |



Even when started in correct order, instructions can complete out-of-order

Structural hazard on writeback stage, can be resolved

#### Example:

| lw x3, 8(x2)   | FE | DE | ALU | MA  | MA  | MA | WB |
|----------------|----|----|-----|-----|-----|----|----|
| addi x2, x2, 1 |    | FE | DE  | ALU | WB  |    |    |
| bnez x2, loop  |    |    | FE  | DE  | ALU | WB |    |

**Problems?** 





Even when started in correct order, instructions can complete out-of-order

Structural hazard on writeback stage, can be resolved

#### Example:

| lw x3, 8(x2)   | FE | DE | ALU | MA  | MA  | MA | WB |
|----------------|----|----|-----|-----|-----|----|----|
| addi x2, x2, 1 |    | FE | DE  | ALU | WB  |    |    |
| bnez x2, loop  |    |    | FE  | DE  | ALU | WB |    |

#### **Problems?**

- What when there is an exception with the load?
- Example: Access fault, handled by OS, then continue











#### Problem:

Following instructions completed when load exception occurs



- Following instructions completed when load exception occurs
- Exception is handled, for example by operating system



- Following instructions completed when load exception occurs
- Exception is handled, for example by operating system
- Processing continues with re-issuing instructions starting with lw





- Following instructions completed when load exception occurs
- Exception is handled, for example by operating system
- Processing continues with re-issuing instructions starting with lw
- addi and bnez will be executed again, functional error





#### Problem:

- Following instructions completed when load exception occurs
- Exception is handled, for example by operating system
- Processing continues with re-issuing instructions starting with 1w
- addi and bnez will be executed again, functional error

Potential solutions





#### Problem:

- Following instructions completed when load exception occurs
- Exception is handled, for example by operating system
- Processing continues with re-issuing instructions starting with lw
- addi and bnez will be executed again, functional error

#### Potential solutions

Imprecise exceptions: The exception handler needs to clean up





#### Problem:

- Following instructions completed when load exception occurs
- Exception is handled, for example by operating system
- Processing continues with re-issuing instructions starting with 1w
- addi and bnez will be executed again, functional error

#### Potential solutions

- Imprecise exceptions: The exception handler needs to clean up
- Start instruction processing only after sure no exception can occur





#### Problem:

- Following instructions completed when load exception occurs
- Exception is handled, for example by operating system
- Processing continues with re-issuing instructions starting with 1w
- addi and bnez will be executed again, functional error

#### Potential solutions

- Imprecise exceptions: The exception handler needs to clean up
- Start instruction processing only after sure no exception can occur
- Buffer results and commit in correct order (forwarding needs to look there too!)



### **Re-Order Buffer**







### **Re-Order Buffer**



Split between instruction retire and architectural commit





### Re-Order Buffer



Split between instruction retire and architectural commit

Re-order buffer (ROB) buffers results after out-of-order retire, commits in-order





### Re-Order Buffer



Split between instruction retire and architectural commit

Re-order buffer (ROB) buffers results after out-of-order retire, commits in-order











Execute multiple instructions in parallel





Execute multiple instructions in parallel

Usually: replicate FUs





Execute multiple instructions in parallel

Usually: replicate FUs

ALU is often used





Execute multiple instructions in parallel

Usually: replicate FUs

- ALU is often used
- LSU as separate FU







Execute multiple instructions in parallel

Usually: replicate FUs

- ALU is often used
- LSU as separate FU

Increases theoretical IPC by number of parallel instructions (**issue width**, here: 2)







#### Exploit instruction level parallelism

| xor x10, x1, x2  | FE | DE | ALU | ROB | WB  |     |     |    |
|------------------|----|----|-----|-----|-----|-----|-----|----|
| addi x13, x13, 1 | FE | DE | ALU | ROB | WB  |     |     |    |
| mul x3, x7, x8   |    | FE | DE  | MUL | MUL | MUL | ROB | WB |
| addi x2, x2, 1   |    | FE | DE  | ALU | ROB | ROB | ROB | WB |



#### Exploit instruction level parallelism

| xor x10, x1, x2  | FE | DE | ALU | ROB | WB  |     |     |    |
|------------------|----|----|-----|-----|-----|-----|-----|----|
| addi x13, x13, 1 | FE | DE | ALU | ROB | WB  |     |     |    |
| mul x3, x7, x8   |    | FE | DE  | MUL | MUL | MUL | ROB | WB |
| addi x2, x2, 1   |    | FE | DE  | ALU | ROB | ROB | ROB | WB |

Need issue width at each part of the pipeline, otherwise limits speedup



Exploit instruction level parallelism

| xor x10, x1, x2  | FE | DE | ALU | ROB | WB  |     |     |    |
|------------------|----|----|-----|-----|-----|-----|-----|----|
| addi x13, x13, 1 | FE | DE | ALU | ROB | WB  |     |     |    |
| mul x3, x7, x8   |    | FE | DE  | MUL | MUL | MUL | ROB | WB |
| addi x2, x2, 1   |    | FE | DE  | ALU | ROB | ROB | ROB | WB |

Need issue width at each part of the pipeline, otherwise limits speedup

Instruction stream split obvious here, but how do we schedule instructions in general?





Scheduling: Select instructions to be started in EX stage





Scheduling: Select instructions to be started in EX stage

• From sequential order (as stored in memory)



Scheduling: Select instructions to be started in EX stage

- From sequential order (as stored in memory)
- Need to obey data dependencies





Scheduling: Select instructions to be started in EX stage

- From sequential order (as stored in memory)
- Need to obey data dependencies

**Static Scheduling** 





Scheduling: Select instructions to be started in EX stage

- From sequential order (as stored in memory)
- Need to obey data dependencies

#### **Static Scheduling**

Execution of instructions pre-determined





Scheduling: Select instructions to be started in EX stage

- From sequential order (as stored in memory)
- Need to obey data dependencies

#### **Static Scheduling**

Execution of instructions pre-determined

#### **Dynamic Scheduling**



Computer Architecture - Chapter 4 - Complex Pipelining



Scheduling: Select instructions to be started in EX stage

- From sequential order (as stored in memory)
- Need to obey data dependencies

#### **Static Scheduling**

Execution of instructions pre-determined

#### **Dynamic Scheduling**

Selection of instructions at runtime





Scheduling: Select instructions to be started in EX stage

- From sequential order (as stored in memory)
- Need to obey data dependencies

#### **Static Scheduling**

Execution of instructions pre-determined

#### **Dynamic Scheduling**

- Selection of instructions at runtime
  - ► In-order (in sequential program order)





Scheduling: Select instructions to be started in EX stage

- From sequential order (as stored in memory)
- Need to obey data dependencies

#### **Static Scheduling**

Execution of instructions pre-determined

#### **Dynamic Scheduling**

- Selection of instructions at runtime
  - ► In-order (in sequential program order)
  - Out-of-order (whenever instructions are ready)





Scheduling: Select instructions to be started in EX stage

- From sequential order (as stored in memory)
- Need to obey data dependencies

#### **Static Scheduling**

Execution of instructions pre-determined

#### **Dynamic Scheduling**

- Selection of instructions at runtime
  - ► In-order (in sequential program order)
  - Out-of-order (whenever instructions are ready)

Scheduling and Superscalarity are independent concepts









Very Long Instruction Word (VLIW)





Very Long Instruction Word (VLIW)

Combine multiple instructions in slots of a long "packet"





Very Long Instruction Word (VLIW)

Combine multiple instructions in slots of a long "packet"

Start complete packet once all instructions in it can be started





Very Long Instruction Word (VLIW)

Combine multiple instructions in slots of a long "packet"

Start complete packet once all instructions in it can be started

Not much hardware overhead, but compilers are hard, challenge: use slots









Approach: Start instructions as soon as possible



Approach: Start instructions as soon as possible

• Once structural hazards are solved: functional unit is free



Approach: Start instructions as soon as possible

- Once structural hazards are solved: functional unit is free
- Once data hazards are solved: dependencies resolved



Approach: Start instructions as soon as possible

- Once structural hazards are solved: functional unit is free
- Once data hazards are solved: dependencies resolved

Keep next instruction(s) in buffer





Approach: Start instructions as soon as possible

- Once structural hazards are solved: functional unit is free
- Once data hazards are solved: dependencies resolved

Keep next instruction(s) in buffer

Instructions are still issued in order (FIFO instruction buffer)





Approach: Start instructions as soon as possible

- Once structural hazards are solved: functional unit is free
- Once data hazards are solved: dependencies resolved

Keep next instruction(s) in buffer

- Instructions are still issued in order (FIFO instruction buffer)
- Where to put it? (IF-DE or DE-EX)







Issue





#### Issue

Decode instruction



#### Issue

- Decode instruction
- Check for structural hazards (can also be instruction buffer full)



#### Issue

- Decode instruction
- Check for structural hazards (can also be instruction buffer full)

### **Read Operands**



#### Issue

- Decode instruction
- Check for structural hazards (can also be instruction buffer full)

### **Read Operands**

Executed after data dependencies are resolved





#### Issue

- Decode instruction
- Check for structural hazards (can also be instruction buffer full)

### **Read Operands**

- Executed after data dependencies are resolved
- Then reads the operands





#### Issue

- Decode instruction
- Check for structural hazards (can also be instruction buffer full)

### **Read Operands**

- Executed after data dependencies are resolved
- Then reads the operands







Scoreboard: Data structure to track execution



Scoreboard: Data structure to track execution

Keeps all active instructions (in FUs)



Scoreboard: Data structure to track execution

Keeps all active instructions (in FUs)

Check current instruction for conflict



Scoreboard: Data structure to track execution

Keeps all active instructions (in FUs)

Check current instruction for conflict

| Instruction | FU | rd | rs1 | rs2 |
|-------------|----|----|-----|-----|
|             |    |    |     |     |
|             |    |    |     |     |
|             |    |    |     |     |
|             |    |    |     |     |







Check scoreboard for every instruction

• Structural hazard: Check if FU can start instruction



- Structural hazard: Check if FU can start instruction
- Read-after-Write: Check SB destination registers for instruction source registers



- Structural hazard: Check if FU can start instruction
- Read-after-Write: Check SB destination registers for instruction source registers
- Write-after-Read: Check SB source registers for instruction destination register





- Structural hazard: Check if FU can start instruction
- Read-after-Write: Check SB destination registers for instruction source registers
- Write-after-Read: Check SB source registers for instruction destination register
- Write-after-Write: Check SB destination register for instruction destination register





Check scoreboard for every instruction

- Structural hazard: Check if FU can start instruction
- Read-after-Write: Check SB destination registers for instruction source registers
- Write-after-Read: Check SB source registers for instruction destination register
- Write-after-Write: Check SB destination register for instruction destination register

Lifecycle of entries





### Check scoreboard for every instruction

- Structural hazard: Check if FU can start instruction
- Read-after-Write: Check SB destination registers for instruction source registers
- Write-after-Read: Check SB source registers for instruction destination register
- Write-after-Write: Check SB destination register for instruction destination register

### Lifecycle of entries

• Start instruction from instruction buffer once no hazards are left, add to SB





### Check scoreboard for every instruction

- Structural hazard: Check if FU can start instruction.
- Read-after-Write: Check SB destination registers for instruction source registers
- Write-after-Read: Check SB source registers for instruction destination register
- Write-after-Write: Check SB destination register for instruction destination register

### Lifecycle of entries

- Start instruction from instruction buffer once no hazards are left, add to SB
- Remove from SB once result was written



# **Scoreboard: Integration**









Use scoreboard for in-order dynamic scheduling



Use scoreboard for in-order dynamic scheduling

Typical: No bypassing (complexity of many functional units)



Use scoreboard for in-order dynamic scheduling

Typical: No bypassing (complexity of many functional units)

Scoreboard for in-order can be simplified



Use scoreboard for in-order dynamic scheduling

Typical: No bypassing (complexity of many functional units)

Scoreboard for in-order can be simplified

Only the next instruction in buffer can be used



Use scoreboard for in-order dynamic scheduling

Typical: No bypassing (complexity of many functional units)

Scoreboard for in-order can be simplified

Only the next instruction in buffer can be used

Can we have WAR and WAW for the next instruction?





Use scoreboard for in-order dynamic scheduling

Typical: No bypassing (complexity of many functional units)

Scoreboard for in-order can be simplified

Only the next instruction in buffer can be used

Can we have WAR and WAW for the next instruction?

No WAR as we are in order





Use scoreboard for in-order dynamic scheduling

Typical: No bypassing (complexity of many functional units)

Scoreboard for in-order can be simplified

Only the next instruction in buffer can be used

Can we have WAR and WAW for the next instruction?

- No WAR as we are in order
- **WAW** can occur, hence need to track destination registers





Use scoreboard for in-order dynamic scheduling

Typical: No bypassing (complexity of many functional units)

Scoreboard for in-order can be simplified

Only the next instruction in buffer can be used

Can we have WAR and WAW for the next instruction?

- No WAR as we are in order
- WAW can occur, hence need to track destination registers
- We only need to know the pending writes (destination registers)





| ld z | x12,  | 8(x9) |     |
|------|-------|-------|-----|
| ld z | x13,  | 0(x7) |     |
| mul  | x17,  | x13,  | x12 |
| sub  | i x18 | , x12 | , 2 |
| mul  | x13,  | x12,  | x18 |
| add  | x10,  | x17,  | x13 |

| Instruction | FU | rd |
|-------------|----|----|
|             |    |    |
|             |    |    |
|             |    |    |
|             |    |    |

| Exc | ec | uti | ior | 1 [ | Dia | gı | ar | n |  |  |  |  |  |  |  |
|-----|----|-----|-----|-----|-----|----|----|---|--|--|--|--|--|--|--|
|     |    |     |     |     |     |    |    |   |  |  |  |  |  |  |  |
|     |    |     |     |     |     |    |    |   |  |  |  |  |  |  |  |
|     |    |     |     |     |     |    |    |   |  |  |  |  |  |  |  |
|     |    |     |     |     |     |    |    |   |  |  |  |  |  |  |  |
|     |    |     |     |     |     |    |    |   |  |  |  |  |  |  |  |
|     |    |     |     |     |     |    |    |   |  |  |  |  |  |  |  |



1 ld x12, 8(x9)
 ld x13, 0(x7)
 mul x17, x13, x12
 subi x18, x12, 2
 mul x13, x12, x18
 add x10, x17, x13

| Instruction | FU | rd |
|-------------|----|----|
|             |    |    |
|             |    |    |
|             |    |    |
|             |    |    |

#### **Execution Diagram**



0: #1 can be started

Computer Architecture - Chapter 4 - Complex Pipelining



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
  mul x17, x13, x12
  subi x18, x12, 2
  mul x13, x12, x18
  add x10, x17, x13

| Instruction | FU  | rd  |
|-------------|-----|-----|
| 1           | LSU | x12 |
|             |     |     |
|             |     |     |
|             |     |     |

#### **Execution Diagram**



0: #1 can be started

1: #2 can be started, #1 reads operands



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
  mul x17, x13, x12
  subi x18, x12, 2
  mul x13, x12, x18
  add x10, x17, x13

| Instruction | FU  | rd  |
|-------------|-----|-----|
| 1           | LSU | x12 |
| 2           | LSU | x13 |
|             |     |     |
|             |     |     |

#### **Execution Diagram**



0: #1 can be started

1: #2 can be started, #1 reads operands

2: #3 is blocked (hazards)



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
  mul x17, x13, x12
  subi x18, x12, 2
  mul x13, x12, x18
  add x10, x17, x13

| Instruction | FU  | rd  |
|-------------|-----|-----|
| 1           | LSU | x12 |
| 2           | LSU | x13 |
|             |     |     |
|             |     |     |



- 0: #1 can be started
- 1: #2 can be started, #1 reads operands
- 2: #3 is blocked (hazards)
- 3: #1 will complete now



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
  mul x17, x13, x12
  subi x18, x12, 2
  mul x13, x12, x18
  add x10, x17, x13

| Instruction | FU    | rd    |
|-------------|-------|-------|
| 1           | -LSU- | -x12- |
| 2           | LSU   | x13   |
|             |       |       |
|             |       |       |



- 0: #1 can be started
- 1: #2 can be started, #1 reads operands
- 2: #3 is blocked (hazards)
- 3: #1 will complete now
- 4: #1 writes result, #2 will complete too



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
- 3 mul x17, x13, x12
   subi x18, x12, 2
   mul x13, x12, x18
   add x10, x17, x13

| Instruction | FU    | rd    |
|-------------|-------|-------|
|             |       |       |
| 2           | -LSU- | -x13- |
|             |       |       |
|             |       |       |



- 0: #1 can be started
- 1: #2 can be started, #1 reads operands
- 2: #3 is blocked (hazards)
- 3: #1 will complete now
- 4: #1 writes result, #2 will complete too
- 5: #2 in WB, #3 can be started



- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
  mul x13, x12, x18
  add x10, x17, x13

| Instruction | FU  | rd  |
|-------------|-----|-----|
| 3           | MUL | x17 |
|             |     |     |
|             |     |     |
|             |     |     |



- 0: #1 can be started
- 1: #2 can be started, #1 reads operands
- 2: #3 is blocked (hazards)
- 3: #1 will complete now
- 4: #1 writes result, #2 will complete too
- 5: #2 in WB, #3 can be started
- 6: #4 can be started



- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- a subi x18, x12, 2
  mul x13, x12, x18
  add x10, x17, x13

| Instruction | FU  | rd  |
|-------------|-----|-----|
| 3           | MUL | x17 |
| 4           | ALU | x18 |
|             |     |     |
|             |     |     |



- 0: #1 can be started
- 1: #2 can be started, #1 reads operands
- 2: #3 is blocked (hazards)
- 3: #1 will complete now
- 4: #1 writes result, #2 will complete too
- 5: #2 in WB, #3 can be started
- 6: #4 can be started
- 7: #5 blocked





- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2 mul x13, x12, x18 add x10, x17, x13

| Instruction | FU  | rd  |
|-------------|-----|-----|
| 3           | MUL | x17 |
| 4           | ALU | x18 |
|             |     |     |
|             |     |     |



- 0: #1 can be started
- 1: #2 can be started, #1 reads operands
- 2: #3 is blocked (hazards)
- 3: #1 will complete now
- 4: #1 writes result, #2 will complete too
- 5: #2 in WB, #3 can be started
- 6: #4 can be started
- 7: #5 blocked





- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
  add x10, x17, x13

| Instruction | FU    | rd    |
|-------------|-------|-------|
| 3           | MUL   | x17   |
| 4           | -ALU- | -x18- |
|             |       |       |
|             |       |       |



- 0: #1 can be started
- 1: #2 can be started, #1 reads operands
- 2: #3 is blocked (hazards)
- 3: #1 will complete now
- 4: #1 writes result, #2 will complete too
- 5: #2 in WB, #3 can be started
- 6: #4 can be started
- 7: #5 blocked
- 9: #4 in WB, #5 can be started





- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
  add x10, x17, x13

| Instruction | FU    | rd              |
|-------------|-------|-----------------|
| 3           | -MUL- | <del>-x17</del> |
| 5           | MUL   | x13             |
|             |       |                 |
|             |       |                 |



- 0: #1 can be started
- 1: #2 can be started, #1 reads operands
- 2: #3 is blocked (hazards)
- 3: #1 will complete now
- 4: #1 writes result, #2 will complete too
- 5: #2 in WB, #3 can be started
- 6: #4 can be started
- 7: #5 blocked
- 9: #4 in WB, #5 can be started
- 10: #6 is blocked



- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
  add x10, x17, x13

| Instruction | FU  | rd  |
|-------------|-----|-----|
|             |     |     |
| 5           | MUL | x13 |
|             |     |     |
|             |     |     |



- 0: #1 can be started
- 1: #2 can be started, #1 reads operands
- 2: #3 is blocked (hazards)
- 3: #1 will complete now
- 4: #1 writes result, #2 will complete too
- 5: #2 in WB, #3 can be started
- 6: #4 can be started
- 7: #5 blocked
- 9: #4 in WB, #5 can be started
- 10: #6 is blocked



- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
  add x10, x17, x13

| Instruction | FU  | rd  |
|-------------|-----|-----|
|             |     |     |
| 5           | MUL | x13 |
|             |     |     |
|             |     |     |



- 0: #1 can be started
- 1: #2 can be started, #1 reads operands
- 2: #3 is blocked (hazards)
- 3: #1 will complete now
- 4: #1 writes result, #2 will complete too
- 5: #2 in WB, #3 can be started
- 6: #4 can be started
- 7: #5 blocked
- 9: #4 in WB, #5 can be started
- 10: #6 is blocked



- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
  add x10, x17, x13

| Instruction | FU  | rd  |
|-------------|-----|-----|
|             |     |     |
| 5           | MUL | x13 |
|             |     |     |
|             |     |     |



- 0: #1 can be started
- 1: #2 can be started, #1 reads operands
- 2: #3 is blocked (hazards)
- 3: #1 will complete now
- 4: #1 writes result, #2 will complete too
- 5: #2 in WB, #3 can be started
- 6: #4 can be started
- 7: #5 blocked
- 9: #4 in WB, #5 can be started
- 10: #6 is blocked



- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13

| Instruction | FU    | rd    |
|-------------|-------|-------|
|             |       |       |
| <u> </u>    | -MUL- | -x13- |
|             |       |       |
|             |       |       |



- 0: #1 can be started
- 1: #2 can be started, #1 reads operands
- 2: #3 is blocked (hazards)
- 3: #1 will complete now
- 4: #1 writes result, #2 will complete too
- 5: #2 in WB, #3 can be started
- 6: #4 can be started
- 7: #5 blocked
- 9: #4 in WB, #5 can be started
- 10: #6 is blocked
- 14: #5 in WB, #6 can be started



- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13

| Instruction | FU  | rd  |
|-------------|-----|-----|
| 6           | ALU | x10 |
|             |     |     |
|             |     |     |
|             |     |     |



- 0: #1 can be started
- 1: #2 can be started, #1 reads operands
- 2: #3 is blocked (hazards)
- 3: #1 will complete now
- 4: #1 writes result, #2 will complete too
- 5: #2 in WB, #3 can be started
- 6: #4 can be started
- 7: #5 blocked
- 9: #4 in WB, #5 can be started
- 10: #6 is blocked
- 14: #5 in WB, #6 can be started



- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13

| Instruction | FU  | rd  |
|-------------|-----|-----|
| 6           | ALU | x10 |
|             |     |     |
|             |     |     |
|             |     |     |



- 0: #1 can be started
- 1: #2 can be started, #1 reads operands
- 2: #3 is blocked (hazards)
- 3: #1 will complete now
- 4: #1 writes result, #2 will complete too
- 5: #2 in WB, #3 can be started
- 6: #4 can be started
- 7: #5 blocked
- 9: #4 in WB, #5 can be started
- 10: #6 is blocked
- 14: #5 in WB, #6 can be started



- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13

| Instruction | FU    | rd    |
|-------------|-------|-------|
| 6           | -ALU- | -x10- |
|             |       |       |
|             |       |       |
|             |       |       |



- 0: #1 can be started
- 1: #2 can be started, #1 reads operands
- 2: #3 is blocked (hazards)
- 3: #1 will complete now
- 4: #1 writes result, #2 will complete too
- 5: #2 in WB, #3 can be started
- 6: #4 can be started
- 7: #5 blocked
- 9: #4 in WB, #5 can be started
- 10: #6 is blocked
- 14: #5 in WB, #6 can be started

# **Example: Ariane CPU**





https://github.com/pulp-platform/ariane/

### Limits of In-Order Issue



- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



### Limits of In-Order Issue



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



#### **Delayed Load Instruction**





### Limits of In-Order Issue



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



#### **Delayed Load Instruction**



Recap: data flow model, approach: reorder instructions







Start of instructions in arbitrary order



Start of instructions in arbitrary order

As soon as hazards of each instruction are resolved



Start of instructions in arbitrary order

- As soon as hazards of each instruction are resolved
- Instruction buffer has window of next N instructions



Start of instructions in arbitrary order

- As soon as hazards of each instruction are resolved
- Instruction buffer has window of next N instructions

But: Out-of-order issue does not improve very much alone!



# **Out-of-Order: Example**



- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13











WAW and WAR limit further reordering





WAW and WAR limit further reordering

Not real dependencies



WAW and WAR limit further reordering

- Not real dependencies
- Artificially added: limitation of registers





WAW and WAR limit further reordering

- Not real dependencies
- Artificially added: limitation of registers

Problem with limited registers





WAW and WAR limit further reordering

- Not real dependencies
- Artificially added: limitation of registers

Problem with limited registers

Number of registers limited by ISA





### WAW and WAR limit further reordering

- Not real dependencies
- Artificially added: limitation of registers

### Problem with limited registers

- Number of registers limited by ISA
- Compiler optimizations limited





### WAW and WAR limit further reordering

- Not real dependencies
- Artificially added: limitation of registers

### Problem with limited registers

- Number of registers limited by ISA
- Compiler optimizations limited
- Especially with different execution paths





WAW and WAR limit further reordering

- Not real dependencies
- Artificially added: limitation of registers

Problem with limited registers

- Number of registers limited by ISA
- Compiler optimizations limited
- Especially with different execution paths

Approach: CPU solves problem by register renaming



# Register Renaming: Example



- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13







# **Register Renaming**





Approach: Use microarchitecture ("virtual" register names)



Approach: Use microarchitecture ("virtual" register names)

Entirely eliminates WAR and WAW hazards



Approach: Use microarchitecture ("virtual" register names)

- Entirely eliminates WAR and WAW hazards
- Not visible to the outside world



Approach: Use microarchitecture ("virtual" register names)

- Entirely eliminates WAR and WAW hazards
- Not visible to the outside world





Approach: Use microarchitecture ("virtual" register names)

- Entirely eliminates WAR and WAW hazards
- Not visible to the outside world

Introduced by Robert Tomasulo (1967)

Reservation stations store instructions and renames





Approach: Use microarchitecture ("virtual" register names)

- Entirely eliminates WAR and WAW hazards
- Not visible to the outside world

- Reservation stations store instructions and renames
- Format of reservation stations (multiple entries per FU)





Approach: Use microarchitecture ("virtual" register names)

- Entirely eliminates WAR and WAW hazards
- Not visible to the outside world

- Reservation stations store instructions and renames
- Format of reservation stations (multiple entries per FU)
  - ► Op: Operation





Approach: Use microarchitecture ("virtual" register names)

- Entirely eliminates WAR and WAW hazards
- Not visible to the outside world

- Reservation stations store instructions and renames
- Format of reservation stations (multiple entries per FU)
  - ► Op: Operation
  - Qj, Qk: Reservation station that produces source registers (pending)





Approach: Use microarchitecture ("virtual" register names)

- Entirely eliminates WAR and WAW hazards
- Not visible to the outside world

- Reservation stations store instructions and renames
- Format of reservation stations (multiple entries per FU)
  - ► Op: Operation
  - Qj, Qk: Reservation station that produces source registers (pending)
  - ► Vj, Vk: Value of source register (once available)





Approach: Use microarchitecture ("virtual" register names)

- Entirely eliminates WAR and WAW hazards
- Not visible to the outside world

- Reservation stations store instructions and renames
- Format of reservation stations (multiple entries per FU)
  - ► Op: Operation
  - Qj, Qk: Reservation station that produces source registers (pending)
  - ► Vj, Vk: Value of source register (once available)
  - Busy: Reservation station is active





Approach: Use microarchitecture ("virtual" register names)

- Entirely eliminates WAR and WAW hazards
- Not visible to the outside world

- Reservation stations store instructions and renames
- Format of reservation stations (multiple entries per FU)
  - ► Op: Operation
  - Qj, Qk: Reservation station that produces source registers (pending)
  - ► Vj, Vk: Value of source register (once available)
  - ▶ Busy: Reservation station is active
- Additionally: Register result status shows which RS produces registers





- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13

| 1 |  |  |  |  |  |  |  |  |  |  |  |  |  |
|---|--|--|--|--|--|--|--|--|--|--|--|--|--|
| 2 |  |  |  |  |  |  |  |  |  |  |  |  |  |
| 3 |  |  |  |  |  |  |  |  |  |  |  |  |  |
| 4 |  |  |  |  |  |  |  |  |  |  |  |  |  |
| 5 |  |  |  |  |  |  |  |  |  |  |  |  |  |
| 6 |  |  |  |  |  |  |  |  |  |  |  |  |  |

|   |      |    | ALU   |    |      |    |    | SU |    |      |    |    | UL |    |
|---|------|----|-------|----|------|----|----|----|----|------|----|----|----|----|
|   | Insn | Vj | Vk Qj | Qk | Insn | Vj | Vk | Qj | Qk | Insn | Vj | Vk | Qj | Qk |
| 0 |      |    |       |    |      |    |    |    |    |      |    |    |    |    |
| 1 |      |    |       |    |      |    |    |    |    |      |    |    |    |    |



- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|   |      |    | ALU   |    |      |    | L  | SU |    |      |    | M  | UL |    |
|---|------|----|-------|----|------|----|----|----|----|------|----|----|----|----|
|   | Insn | Vj | Vk Qj | Qk | Insn | Vj | Vk | Qj | Qk | Insn | Vj | Vk | Qj | Qk |
| 0 |      |    |       |    | 1    | 8  |    | -  | -  |      |    |    |    |    |
| 1 |      |    |       |    |      |    |    |    |    |      |    |    |    |    |



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|   |      |    | Α  | LU |    |      |    | L  | SU |    |      |    | M  | UL |    |
|---|------|----|----|----|----|------|----|----|----|----|------|----|----|----|----|
|   | Insn | Vj | Vk | Qj | Qk | Insn | Vj | Vk | Qj | Qk | Insn | Vj | Vk | Qj | Qk |
| 0 |      |    |    |    |    | 1    | 8  |    | _  | -  |      |    |    |    |    |
| 1 |      |    |    |    |    | 2    | 0  |    | _  | -  |      |    |    |    |    |



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|                |      |    | Α  | LU |    |      |    | L     | SU |    |      |    | M  | IUL  |      |
|----------------|------|----|----|----|----|------|----|-------|----|----|------|----|----|------|------|
|                | Insn | Vj | Vk | Qj | Qk | Insn | Vj | Vk    | Qj | Qk | Insn | Vj | Vk | Qj   | Qk   |
| 0              |      |    |    |    |    | 1    | 8  | • • • | _  | _  | 3    | -  | -  | LSU0 | LSU1 |
| $\overline{1}$ |      |    |    |    |    | 2    | 0  |       | -  | -  |      |    |    |      |      |



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|                |      |          | Α | LU   |   |   |    | L  | SU |    |      |    | M  | UL   |      |
|----------------|------|----------|---|------|---|---|----|----|----|----|------|----|----|------|------|
|                | Insn | <u> </u> |   |      |   |   | Vj | Vk | Qj | Qk | Insn | Vj | Vk | Qj   | Qk   |
| 0              | 4    | _        | 2 | LSU0 | _ | 1 | 8  |    | -  | _  | 3    | _  | _  | LSU0 | LSU1 |
| $\overline{1}$ |      |          |   |      |   | 2 | 0  |    | _  | _  |      |    |    |      |      |



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|   |      |    | Α  | LU              |    |      |    | L  | SU |    |      |    | M  | UL              |      |
|---|------|----|----|-----------------|----|------|----|----|----|----|------|----|----|-----------------|------|
|   | Insn | Vj | Vk | Qj              | Qk | Insn | Vj | Vk | Qj | Qk | Insn | Vj | Vk | Qj              | Qk   |
| 0 | 4    |    | 2  | <del>LSU0</del> | -  | 1    | 8  |    | _  | -  | 3    |    | -  | <del>LSU0</del> | LSU1 |
| 1 |      |    |    |                 |    | 2    | 0  |    | _  | -  | 5    |    | -  | -               | ALU0 |



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|   |      |    | Α  | LU   |      |      |    | L  | SU |    |      |    | M  | UL |      |
|---|------|----|----|------|------|------|----|----|----|----|------|----|----|----|------|
|   | Insn | Vj | Vk | Qj   | Qk   | Insn | Vj | Vk | Qj | Qk | Insn | Vj | Vk | Qj | Qk   |
| 0 | 4    |    | 2  | -    | -    |      |    |    |    |    | 3    |    | _  | -  | LSU1 |
| 1 | 6    | -  | -  | MUL0 | MUL1 | 2    | 0  |    | -  | -  | 5    |    | _  | _  | ALU0 |



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|                |      |    | Α  | LU   |      |      |    | L  | SU |    |      |    | M  | IUL |      |
|----------------|------|----|----|------|------|------|----|----|----|----|------|----|----|-----|------|
|                | Insn | Vj | Vk | Qj   | Qk   | Insn | Vj | Vk | Qj | Qk | Insn | Vj | Vk | Qj  | Qk   |
| 0              | 4    |    | 2  | -    | -    |      |    |    |    |    | 3    |    | _  | -   | LSU1 |
| $\overline{1}$ | 6    | -  | -  | MUL0 | MUL1 | 2    | 0  |    | -  | -  | 5    |    | -  | -   | ALU0 |



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|   |      |    | Α  | LU   |      |      |    | L  | SU |    |      |    | М  | UL |      |
|---|------|----|----|------|------|------|----|----|----|----|------|----|----|----|------|
|   | Insn | Vj | Vk | Qj   | Qk   | Insn | Vj | Vk | Qj | Qk | Insn | Vj | Vk | Qj | Qk   |
| 0 | 4    |    | 2  | _    | _    |      |    |    |    |    | 3    |    | _  | -  | LSU1 |
| 1 | 6    | -  | -  | MUL0 | MUL1 | 2    | 0  |    | _  | _  | 5    |    |    | -  | ALU0 |



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|   |      |    | Α  | LU   |      |      |    | L  | SU |    |      |    | M  | IUL |      |
|---|------|----|----|------|------|------|----|----|----|----|------|----|----|-----|------|
|   | Insn | Vj | Vk | Qj   | Qk   | Insn | Vj | Vk | Qj | Qk | Insn | Vj | Vk | Qj  | Qk   |
| 0 |      |    |    |      |      |      |    |    |    |    | 3    |    | -  | -   | LSU1 |
| 1 | 6    | -  | _  | MUL0 | MUL1 | 2    | 0  |    | -  | _  | 5    |    |    | _   | _    |



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|                |      |    | Α  | LU   |      |      |    | L  | SU |    |      |    | M  | IUL |      |
|----------------|------|----|----|------|------|------|----|----|----|----|------|----|----|-----|------|
|                | Insn | Vj | Vk | Qj   | Qk   | Insn | Vj | Vk | Qj | Qk | Insn | Vj | Vk | Qj  | Qk   |
| 0              |      |    |    |      |      |      |    |    |    |    | 3    |    | _  | _   | LSU1 |
| $\overline{1}$ | 6    | -  | _  | MUL0 | MUL1 | 2    | 0  |    | -  | _  | 5    |    |    | -   | -    |



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|                |      |    | Α  | LU   |      |      | SU |    | MUL |    |      |    |    |    |      |
|----------------|------|----|----|------|------|------|----|----|-----|----|------|----|----|----|------|
|                | Insn | Vj | Vk | Qj   | Qk   | Insn | Vj | Vk | Qj  | Qk | Insn | Vj | Vk | Qj | Qk   |
| 0              |      |    |    |      |      |      |    |    |     |    | 3    |    |    | _  | LSU1 |
| $\overline{1}$ | 6    | -  | _  | MUL0 | MUL1 | 2    | 0  |    | -   | -  | 5    |    |    | _  | -    |



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|                |                          |   | Α | LU   |      |                          | SU |  | MUL |  |      |    |    |    |    |
|----------------|--------------------------|---|---|------|------|--------------------------|----|--|-----|--|------|----|----|----|----|
|                | Insn   Vj   Vk   Qj   Qk |   |   |      |      | Insn   Vj   Vk   Qj   Qk |    |  |     |  | Insn | Vj | Vk | Qj | Qk |
| 0              |                          |   |   |      |      |                          |    |  |     |  | 3    |    |    | _  | _  |
| $\overline{1}$ | 6                        | _ | _ | MUL0 | MUL1 |                          |    |  |     |  | 5    |    |    | _  | _  |



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|                |      |    | Α  | LU   |      |                          | SU |  | MUL |  |      |    |    |    |    |
|----------------|------|----|----|------|------|--------------------------|----|--|-----|--|------|----|----|----|----|
|                | Insn | Vj | Vk | Qj   | Qk   | Insn   Vj   Vk   Qj   Qk |    |  |     |  | Insn | Vj | Vk | Qj | Qk |
| 0              |      |    |    |      |      |                          |    |  |     |  | 3    |    |    | _  | -  |
| $\overline{1}$ | 6    | -  |    | MUL0 | MUL1 |                          |    |  |     |  | 5    |    |    | _  | -  |



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|   |      |    | Α  | LU   |    | LSU  |    |    |    |    | MUL  |    |    |    |    |
|---|------|----|----|------|----|------|----|----|----|----|------|----|----|----|----|
|   | Insn | Vj | Vk | Qj   | Qk | Insn | Vj | Vk | Qj | Qk | Insn | Vj | Vk | Qj | Qk |
| 0 |      |    |    |      |    |      |    |    |    |    | 3    |    |    | -  | -  |
| 1 | 6    | _  |    | MUL0 |    |      |    |    |    |    |      |    |    |    |    |



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|   |                          | ALU |  |      |  |                          | LSU |  |  |  |      | MUL |    |    |    |  |
|---|--------------------------|-----|--|------|--|--------------------------|-----|--|--|--|------|-----|----|----|----|--|
|   | Insn   Vj   Vk   Qj   Qk |     |  |      |  | Insn   Vj   Vk   Qj   Qk |     |  |  |  | Insn | Vj  | Vk | Qj | Qk |  |
| 0 |                          |     |  |      |  |                          |     |  |  |  | 3    |     |    | -  | _  |  |
| 1 | 6                        | -   |  | MUL0 |  |                          |     |  |  |  |      |     |    |    |    |  |



- 1 ld x12, 8(x9)
- 2 ld x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|   |      |    | Α  | LU   |      |    | SU |      | MUL |    |    |    |   |   |
|---|------|----|----|------|------|----|----|------|-----|----|----|----|---|---|
|   | Insn | Vj | Qj | Qk   | Insn | Qj | Qk | Insn | Vj  | Vk | Qj | Qk |   |   |
| 0 |      |    |    |      |      |    |    |      |     | 3  |    |    | - | _ |
| 1 | 6    |    |    | MUL0 |      |    |    |      |     |    |    |    |   |   |



- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|   |      |    | Α  | LU |    |                          | L | SU |  | MUL |  |                 |  |  |  |  |
|---|------|----|----|----|----|--------------------------|---|----|--|-----|--|-----------------|--|--|--|--|
|   | Insn | Vj | Vk | Qj | Qk | Insn   Vj   Vk   Qj   Qk |   |    |  |     |  | Insn Vj Vk Qj G |  |  |  |  |
| 0 |      |    |    |    |    |                          |   |    |  |     |  |                 |  |  |  |  |
| 1 | 6    |    |    |    |    |                          |   |    |  |     |  |                 |  |  |  |  |



- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|   |      |    | Α  | LU |    |                          | L | SU |  | MUL |  |                          |  |  |  |  |
|---|------|----|----|----|----|--------------------------|---|----|--|-----|--|--------------------------|--|--|--|--|
|   | Insn | Vj | Vk | Qj | Qk | Insn   Vj   Vk   Qj   Qk |   |    |  |     |  | Insn   Vj   Vk   Qj   Qk |  |  |  |  |
| 0 |      |    |    |    |    |                          |   |    |  |     |  |                          |  |  |  |  |
| 1 | 6    |    |    |    |    |                          |   |    |  |     |  |                          |  |  |  |  |



- 1 ld x12, 8(x9)
- 2 1d x13, 0(x7)
- 3 mul x17, x13, x12
- 4 subi x18, x12, 2
- 5 mul x13, x12, x18
- 6 add x10, x17, x13



|   |      |    | Α  | LU |    |                          | L | SU |  | MUL |  |    |       |    |
|---|------|----|----|----|----|--------------------------|---|----|--|-----|--|----|-------|----|
|   | Insn | Vj | Vk | Qj | Qk | Insn   Vj   Vk   Qj   Qk |   |    |  |     |  | Vj | Vk Qj | Qk |
| 0 |      |    |    |    |    |                          |   |    |  |     |  |    |       |    |
| 1 | 6    |    |    |    |    |                          |   |    |  |     |  |    |       |    |