

### COMPUTER ORGANIZATION AND DESIGN

The Hardware/Software Interface



## **Chapter 1**

# Computer Abstractions and Technology

Updated by Dr. Waleed Dweik



## **The Computer Revolution**

- Progress in computer technology
  - Underpinned by Moore's Law
- Makes novel applications feasible
  - Computers in automobiles
  - Cell phones
  - Human genome project
  - World Wide Web
  - Search Engines
- Computers are pervasive



### Moore's Law



# transisters per unit area doubles every
18 - to- 24 months

### **Classes of Computers**

- Personal computers
  - General purpose, variety of software
  - Subject to cost/performance tradeoff
- Server computers
  - Network based
  - High capacity, performance, reliability

### **Classes of Computers**

### Supercomputers



- High-end scientific and engineering calculations
- Highest capability but represent a small fraction of the overall computer market
- Embedded computers
  - Hidden as components of systems
  - Stringent performance/cost constraints
  - Run one set of related applications that are normally integrated with the hardware



### The PostPC Era

- Personal Mobile Device (PMD)
  - Battery operated
  - Connects to the Internet
  - Hundreds of dollars
  - Smart phones, tablets, electronic glasses
- Cloud computing
  - Warehouse Scale Computers (WSC)
  - Software as a Service (SaaS)
  - Portion of software run on a PMD and a portion run in the Cloud
  - Amazon and Google



# **Eight Great Ideas (1)**

Design for *Moore's Law* 



- Transistors Integrated circuits resources double every 18-24 months
- Computer design takes years, the resources can double or quadruple between the start and finish of the project  $(Z) \xrightarrow{2021} \longrightarrow 2023 (2*Z)$
- Use *abstraction* to simplify design
  - Represent the design at multiple levels such that low-level details are hidden to offer simpler model at high-level
  - Increase productivity for computer architects and programmers
- Make the common case fast



- Enhance the performance is better than optimizing the rare case
- Example?



### Make the Common Case Fast

```
\frac{1}{1} \frac{ddikion(+)}{division(+)} \Rightarrow 2 \text{ ns} = 2 \times 10^9 \text{ seconds}
           division (=) => 20 ns = 20x109 seconds
      90% of the operation; it performs addition (common)
10% of a good division (vare)
      Time = 2 × 0.94 + 20 × 0.14 = 1.84 + 24
                                               = (3.87) NS
       ( Y = number of operations)
                                         Time = 1x0.9y + 20x0.1y
               addition (+) => 1 ns
 Solution 1:
               division (+) => 20 ns
                                               = 0.9y+ Zy=(P.9y)
Solution 2.
                addition (+) =) 2 ms
                                        Time = 2x0.9y + 15x0.14.
                division (+) => 15 ms
                                               = 1.87 + 1.57
                                                 = (3.3y)ns
```

#### Eight Great Ideas (2) CPU, Performance via parallelism + Performance via pipelining Special pattern of parallelism Example: Human bridge to fight fires Performance via prediction Better ask for forgiveness than ask for permission accuracy *Hierarchy* of memories Speed, size, and cost Fastest, smallest, and most expensive at the top of the hierarchy (Cache) Slowest, largest, and least expensive at the bottom of the hierarchy (Hard Drives) Dependability via redundancy

# **Eight Great Ideas (3)**

- Why these ideas are considered great?
  - Lasted long after the 1st computer that used them
  - They have been around for the last 60 years

# **Below Your Program**

- Application software
  - Written in high-level language (HLL)
    - Allow the programmer to think in a more natural language (look much more like a text than tables of cryptic symbols)
    - Improved productivity (conciseness)
    - Programs become independent from hardware
- System software
  - Compiler: translates HLL code to machine code
  - Operating System: service code
    - Handling input/output
    - Managing memory and storage
    - Scheduling tasks & sharing resources
    - E.g. Windows, Linux, iOS
- Hardware
  - Processor, memory, I/O controllers





a= b+c;



# **Levels of Program Code**

#### High-level language

- Level of abstraction closer to problem domain
- Provides for productivity and portability
- E.g. A + B

#### Assembly language

- Textual representation of machine instructions
  - E.g. Add A,B

#### Hardware representation

- Binary digits (bits)
- Encoded instructions and data program
- E.g. 11100011010



(for RISC-V)

### Components of a Computer

#### **The BIG Picture**



- Same components for all kinds of computer
  - Desktop, server, embedded
- Five components
  - Input, output, memory,

    datapath and control 3->000cessor
  - Input/output includes
    - User-interface devices
      - Display, keyboard, mouse
- Storage devices
  - Hard disk, CD/DVD, flash
  - Network adapters
    - For communicating with other computers



# **Through the Looking Glass**

- LCD (liquid crystal display) screen: picture elements (pixels)
- An image is composed of a matrix of pixels called a bit map
  - Bit map size is based on screen size and resolution
  - 1024x768 to 2048x1536
- Hardware support for graphics
  - Frame buffer (raster refresh buffer) to store the bit map image to be displayed

of the

00000-0

Bit pattern for pixels is read out at the refresh rate



### Touchscreen

- PostPC device
- Supersedes keyboard and mouse
- Resistive and
- Capacitive types
  - Most tablets, smart phones use capacitive
  - Capacitive allows multiple touches simultaneously



# Inside the Processor (CPU)

- Datapath: performs operations on data Control: sequences datapath, memory, ...
- Cache memory
  - Small fast SRAM memory for immediate access to data

adder/sub Varapath

multiplear

adolls-b Control unit

select lines

## A Safe Place for Data (1)

- Volatile main (primary) memory
  - DRAMs: Dynamic Random Access Memory
  - Holds data and programs while running
  - Loses instructions and data when power off
- Non-volatile secondary memory
  - Magnetic disk
  - Flash memory in PMDs
  - Optical disk (CDROM, DVD)















### A Safe Place for Data (2)



# **Defining Performance**

Which airplane has the best performance?









# Response Time and Throughput

- Response or execution time
  - How long it takes to do a task
  - Individual users target reducing the response time
- Throughput
  - Total work done per unit time
    - e.g., tasks/transactions/... per hour
  - Datacenter managers target increasing throughput
- How are response time and throughput affected by
  - Replacing the processor with a faster version?
- Adding more processors?
- We'll focus on response time for now... response time





### **Relative Performance**

- Define Performance = 1/Execution Time
- "X is n time faster than Y" Performancex

Performance<sub>x</sub>/Performance<sub>y</sub>

= Execution time $_{\times}$  /Execution time $_{\times} = n$ 



- 10s on A, 15s on B
- Execution Time<sub>B</sub> / Execution Time<sub>A</sub>

$$= 15s / 10s = (1.5)$$

So A is 1.5 times faster than B

### **Measuring Execution Time**

Elapsed time (response or wall clock time) of Free dia



- Total response time, including all aspects OProcessing, I/OOOS overhead, idle time
- System performance = 1 / Elapsed time
- CPU time = User CPUtine + Syshem CPU time
  - Time spent processing a given job
    - Discounts I/O time, other jobs' shares idle time
  - Comprises of:
    - User CPU time CPU time spent in the program
    - System CPU time CPU time spent in the OS performing tasks on behalf of the program
  - CPU performance = 1 / User CPU time



### **CPU Clocking**

Digital Synchronous System

Operation of digital hardware governed by a constant-rate clock



- Clock period: duration of a clock cycle
  - e.g.,  $250ps = 0.25ns = 250 \times 10^{-12}s$

Clock frequency (rate): cycles per second

e.g., 
$$4.0GHz = 4000MHz = 4.0 \times 10^{9}Hz$$

4×107 HZ

Chapter 1 — Computer Abstractions and Technology — 25

### **CPU Time**



- Performance improved by
  - Reducing number of clock cycles
  - Increasing clock rate
  - Hardware designer must often trade off clock rate against cycle count

### **CPU Time Example**

- Computer A: 2GHz clock, 10s CPU time
   Designing Computer B
   Aim for 6s CPU time

  CPU clock (syle = 1.2 × 2×10 a)

  CPU
  - Can do faster clock, but causes 1.2 × clock cycles 2×10
- How fast must Computer B clock be?

Clock Rate<sub>B</sub> = 
$$\frac{\text{Clock Cycles}_{B}}{\text{CPU Time}_{B}} = \frac{1.2 \times \text{Clock Cycles}_{A}}{6s}$$

Clock Cycles<sub>A</sub> = CPU Time<sub>A</sub> × Clock Rate<sub>A</sub>

$$= 10s \times 2GHz = 20 \times 10^{9}$$

$$= 1.2 \times 20 \times 10^{9}$$
Clock Rate<sub>B</sub> =  $\frac{1.2 \times 20 \times 10^{9}}{6s} = \frac{24 \times 10^{9}}{6s} = 4GHz$ 

$$= 4GHz$$

$$= 4GHz$$

Computer Abstractions and Technology — 27

### Instruction Count and CPI

Clock Cycles = Instruction Count × Cycles per Instruction CPU Time = Instruction Count × CP × Clock Cycle Time Instruction Count × CPI Clock Rate

- Instruction Count for a program Consists of 100 Instructions

  Determined by program, ISA and compiler \*each instruction
  require 3 cycles
  to complete
- Average cycles per instruction
  - Determined by CPU hardware
  - If different instructions have different CPI
    - Average CPI affected by instruction mix

CPU time = 300 X T



# CPI Example on Averye, each instruction vernives 2 cycles to execute

- Computer A: Cycle Time = 250ps, CPI = 2.0
- Computer B: Cycle Time = 500ps, CPI = 1.2
- Same ISA and Complier
- Which is faster, and by how much?

$$\begin{array}{lll} & \text{CPU Time}_{A} = \text{Instruction Count} \times \text{CPI}_{A} \times \text{Cycle Time}_{A} \\ & = I \times 2.0 \times 250 \text{ps} = I \times 500 \text{ps} \\ & \text{CPU Time}_{B} = \text{Instruction Count} \times \text{CPI}_{B} \times \text{Cycle Time}_{B} \\ & = I \times 1.2 \times 500 \text{ps} = I \times 600 \text{ps} \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{I \times 500 \text{ps}} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{A} = 1.2 \\ & \text{CPU Time}_{A} = \underbrace{I \times 600 \text{ps}}_{A} = 1.2 \\ & \text{CPU Time}_{A} = 1.2 \\ & \text{CPU Time$$

ICA = ICR

Same ISA + Same Compiler = Same (Same ) ICi

Same Hardware => Same CPI i Same T (Same R)

### **CPI in More Detail**

If different instruction classes take different numbers of cycles clock Godes = ICXCPI

clock Cycles = 
$$\sum_{i=1}^{n} (CPl_i \times Instruction Count_i)$$
with the second state of the second seco

19/4X = 1×10 + 1 ×5 + 10×3 + 7 × 4
Weighted average CPI = 10+5+30+28= 73

$$CPI = \frac{Clock \ Cycles}{Instruction \ Count} = \sum_{i=1}^{n} \left( CPI_i \times \frac{Instruction \ Count_i}{Instruction \ Count} \right)$$

Relative frequency

total



### **CPI Example**

Alternative compiled code sequences using instructions in classes A B C

| CPIST = E CPIi × | ICI = CPIA | $\times \frac{\Sigma C}{1} + C$ | PÎB X ICB   | > + CP. |
|------------------|------------|---------------------------------|-------------|---------|
| Class            | Α          | В                               | С           | ,       |
| CPI for class    | 100        | 2 ~                             | 3           |         |
| IC in sequence 1 | 2          | 1                               | 2 1         | Jc'= ;  |
| IC in sequence 2 | 4          | 1                               | 1           | Ics:    |
|                  | 105        | . ~ 🗸                           | 2 X \ 1 2 X | 1       |

Sequence 1: IC = 5 Sequence 2: 
$$IC = 6$$

• Clock Cycles = 
$$2 \times 1 + 1 \times 2 + 2 \times 3$$
 =  $4 \times 1 + 1 \times 2$   
= 10

• Avg. CPI = 
$$10/5 = 2.0$$
 • Avg. CPI =  $9/6 = 1.5$ 

\* seg 2 is 10 Faster than seg 1 Program Clock byle s CPIany = Clack Cycles, X T = 10T = IC1 x CPlay X T = 5x2xT=10T Clock Cycles, XT = 9T CPO time seq = 6×1.5×T IC2 XCPI2aux7

Chapter 1 — Computer Abstractions and Technology — 35

MORGAN KAUFMANN

Given Two Compiled Sequences of the same program with the IC, XCPIAVS XT, Rollowing information. CP Utime = = TC1 × 2.4 × T1 CPIarg, = & CPIi × ICi class 10% Relative of Seq 1 Grequery  $= 3 \times 0.2 +$ 5% - Segz 2 x 0.1 + 1× 6.4 + CPU Sam on the 4 × 0.3 = 0.6+ 6.2+0.4 41.2 = 2.4  $3 \times 0.5 + 2 \times 0.05$ CPlang = +4x0.45 = 1.5 + 6.1 + 1.8 = 3.4 ICZXCPJay 2

### One More Example ...

Two compiled sequences of the same program are given below. The upper table gives the number of instructions from each type for sequence-1 and sequence-2. The lower table gives the CPI for each instruction type. Given that the two sequences are running on the <a href="mailto:same computer">same computer</a>, What is the number of instructions of type B in sequence-2 that will make sequence-1 two times faster than sequence-2?

$$\frac{ET_2}{ET_1} = \frac{CPU\ Clock\ Cycles_2}{CPU\ Clock\ Cycles_1} = 2$$

$$\frac{2 \times 1 + ? \times 2 + 2 \times 3}{1 \times 1 + 2 \times 2 + 4 \times 3} = \frac{8 + ? \times 2}{17} = 2$$

$$8 + ? \times 2 = 34$$

$$? = \frac{34 - 8}{2} = 13$$

|   | В | С             |
|---|---|---------------|
|   |   |               |
|   |   | 4             |
| 2 |   | 2             |
|   |   |               |
|   | В | \             |
|   | 2 | 3             |
|   |   | (?)<br>B<br>2 |

CPUtime<sub>1</sub>

CPUtime<sub>1</sub>

Clock Cycle 2 X 
$$\sqrt{7}$$
 = 2

Clock Cycle, X  $\sqrt{7}$ ,



#### COMPUTER ORGANIZATION AND DESIGN

The Hardware/Software Interface



### **Chapter 2**

# Instructions: Language of the Computer

#### **Instruction Set**

- The repertoire of instructions of a computer
- Instructions represent the computer language
  - Every language has vocabulary → Instruction Set
- Different computers have different instruction sets
  - But with many aspects in common (Why?)
    - All computers are constructed from hardware technologies based on similar underlying principles
    - There are a few basic operations that all computers must provide
- Early computers had very simple instruction sets
- Many modern computers also have simple instruction sets
- RISC: Reduced Instruction Set Computer
  - Fixed length, fast execution, and simple functionality
- CISC: Complex Instruction Set Computer
  - Variable length, slow execution, and complex functionality



#### **Famous Commercial ISA**

- MIPS is an elegant example of the instruction sets designed since the 1980s. (RISC)
- ARM instruction set from ARM Holdings plc introduced in 1985. (RISC)
  - More than 14 billion chips with ARM processors were manufactured in 2015, making them the most popular instruction sets in the world.
- Intel x86, which powers both the PC and the Cloud of the post-PC era. (CISC)

#### The RISC-V Instruction Set

- Used as the example throughout the book
- Developed at UC Berkeley as open ISA
- Now managed by the RISC-V Foundation (<u>riscv.org</u>)
- Typical of many modern ISAs
  - See RISC-V Reference Data tear-out card
- Similar ISAs have a large share of embedded core market
  - Applications in consumer electronics, network/storage equipment, cameras, printers, ...
- RISC-V has three design principles



#### **Arithmetic Operations**

- Each arithmetic instruction performs only one operation (Add, subtract, etc.) and must always have exactly three operands
  - Two sources and one destination

- All arithmetic operations have this form
  - Example: We want to add four numbers (b, c, d, and e) and

```
addto,b,c add a, b, c a-b+c b+c add a, a, d a=a+d b+c+d b+c+d add a, a, e a=a+e b+c+d and as Source

Design Principle 1: Simplicity force
```

- Design Principle 1: Simplicity favors regularity
  - Regularity makes implementation simpler
  - Simplicity enables higher performance at lower cost



#### **Arithmetic Example**

C code:

$$f = (g + h) - (i + j);$$

Compiled RISC-V code:

```
add t0, g, h // temp t0 = g + h add t1, i, j // temp t1 = i + j subfifigure for t0, t1 // f = t0 - t1

Source 1 operation Source 2
```

### Register Operands

- Arithmetic instructions use register operands

- RISC-V has a 32 × 64-bit register file
  - Use for frequently accessed data
  - 64-bit data is called a "doubleword"
    - 32 x 64-bit general purpose registers x0 to x31
  - 32-bit data is called a "word"



- Design Principle 2: (Smaller is faster)
  - Variables in HLL are unlimited while there are only 32 registers in RISC-V
  - Designers must balance the need for more registers by the programmers and the desire to keep the clock rate fast (c.f. main memory: millions of locations)
  - Number of registers also affects the number of bits it would take to represent a specific register in the instruction format







## RISC-V Registers RISC-

RISC-V V-via ble length

- x0: the constant value 0
- x1: return address
- x2: stack pointer
- x3: global pointer
- x4: thread pointer
- x5 x7, x28 x31: temporaries
- x8: frame pointer
- x9, x18 x27: saved registers
- x10 x11: function arguments/results
- x12 x17: function arguments







data -> 32-bit instruction > 1/ RISC-V 64-bit version

data \_ 5 64-bit instructions \_ 32-bit

Chapter 2 — Instructions: Language of the Computer —

#### Register Operand Example

C code:

$$f = (g + h) - (i + j);$$
  
 $f,9.4.9, j in x19, x20, ..., x23$ 

Compiled RISC-V code:

#### **Memory Operands**

- Main memory used for composite data
  - Arrays, structures, dynamic data
- To apply arithmetic operations
  - Load values from memory into registers
  - Store result from register to memory
- RISC-V provides <u>data transfer</u> instructions to transfer data from memory to registers and vice-versa

### **Memory Organization**

Memory can be thought of as a single dimensional array
 Memory[0] = 1, Memory[1] = 101, ...
 Index used with HLL

Memory address = index × 8 | Index |

access a double word (8 bytes) in memory, the instruction

To access a double word (8 bytes) in memory, the instruction must provide the memory address for that doubleword

Memory is byte addressed

Memory [8]

Each address identifies an 8-bit byte

RISC-V is Little Endian

Least-significant byte at least address of a word

c.f. Big Endian: most-significant byte at least address

RISC-V does not require words/doublewords to be aligned in memory

Unlike some other ISAs



Chapter 2 — Instructions: Language of the Computer — 12

Data

100

10

101

1

Data

Memory address = Starting address + (index \* Data Size) in bytes) (RISC-V) Little Endian Us. Big Endian 64-bit data: (1ABZ OCDF 98E5 468C) 64/4 = 16 digits havad = 0X, 1AB2 OCDF 98E5 46,8C MSB OX IA hexa decima little OXES Endisn Endian 1000 1100 - 0x 8C

Chapter 2 — Instructions: Language of the Computer — 13

#### **Data Transfer Instructions**

- Load: copies data from a memory location to a register

  | Id = | bad | doubleword | contents of base registers |
  | Id x28, 32 (x18) = (x28) ← Memory [32 + (x18)] |
  | Vegish | offset | source | base register | contents of base registers |
  | represent the starting address | contents of base registers |
  | represent the starting address | contents of base registers |
  | represent the starting address | contents of base registers |
  | represent the starting address | contents of base registers |
  | represent the starting address | contents of base registers |
  | represent the starting address | contents of base registers |
  | represent the starting address | contents of base registers |
  | represent the starting address | contents of base registers |
  | represent the starting address | contents of base registers |
  | represent the starting address | contents of base registers |
  | represent the starting address | contents of base registers |
  | represent the starting address | contents of base registers |
  | represent the starting address | contents of base registers |
  | represent the starting address | contents of base registers |
  | represent the starting address | contents of base registers |
  | represent the starting address | contents of base registers |
  | representation | contents of base registers | contents of base registers |
  | representation | contents of base registers | contents of base registers |
  | representation | contents of base registers | contents of base registers | contents | co
  - Store: copies data from a register to a memory location
    - sd ≡ store doubleword
- $sd(x28) 32 (x18) = Memory [32 + (x18)] \leftarrow (x28)$ offset base register

Double word = 64-bit

Let 
$$X_{28}$$
,  $O(X_{13})$ 

(contains of the second of the second

- (1) memory address = 4 + (x7) = 4 + 4 = 8
- (2) Memory [8] = (X25)



Chapter 2 — Instructions: Language of the Computer — 16

## **Memory Operand Example 1**

array called "A" of type long long int C code: 64-bit g = h + A[8];g in x20, h in x21, base address of A in x22 Compiled RISC-V code: Index 8 requires offset of 64 8 bytes per doubleword  $1d \times 9, 64(\times 22)$  # 1oad doublewordadd x20, x21, x9base register offset -

Chapter 2 — Instructions: Language of the Computer — 17

### **Memory Operand Example 2**

C code: 
$$A[12] = h + A[8];$$

h in x21, base address of A in x22

- Compiled RISC-V code:
  - Index 8 requires offset of 64
  - Index 12 requires offset of 96
    - 8 bytes per doubleword

1d 
$$x9$$
,  $64(x22)(x9) = A[8]$   
add  $x9$ ,  $x21$ ,  $x9(x9) = h + A[8]$   
sd  $x9$ ,  $96(x22)$   $A[n] = h + A[8]$ 



base registr of "A" is X22 Ld ×5, 64 (×22) add x6, x21, x5 50 ×6, 96 (422) offset = index \* DataSize 5ub x19, x20, X5

#### Registers vs. Memory

- Registers are faster to access than memory and consume less energy
- Operating on memory data requires loads and stores
  - More instructions to be executed
- Compiler must use registers for variables as much as possible to get better performance and conserve energy
  - Only spill to memory for less frequently used variables (i.e. register spilling)
  - Register optimization is important!



## Immediate Operands



Constant data specified in an instruction

section 1.6 up to slide 20

No subtract immediate instruction

Just use a negative constant addi x22, x22, -1

Example of the great idea "Make the common case fast"

Small constants are common

More than half of the RISC-V arithmetic instructions have constants as variables

Immediate operand avoids a load instruction

addi X25, X23, 7

#### **The Constant Zero**

- RISC-V dedicates register "x0" to be hardwired to the value zero
  - Cannot be overwritten

- Useful for common operations
  - E.g., negate the value in a register sub x22, x0, x21



## **Binary Integers**

- Humans use decimal system (Base = 10)
  - Digits: 0, 1, 2, ..., 9
- In computers, numbers are stored as series of high and low electronic signals → Binary system (Base = 2)
  - Digits: 0, 1
  - Digit = bit

In any number with base (r), the value of the ith digit (d)

i starts from 0 and increases from right to left

Chapter 2 — Instructions: Language of the Computer — 23



## Unsigned Binary Integers n-bir

 $\Rightarrow (r=2) \xrightarrow{\chi_{n-1} - \dots \times_2 \times_1 \times_0}$ 

n=4 (0)10=(0000)2

Given an n-bit number

$$x = x_{n-1}2^{n-1} + x_{n-2}2^{n-2} + \dots + x_12^1 + x_0 2^0$$

- Range: 0 to +2<sup>n</sup> 1
- Example
  - $0000 0000 \dots 0000 1011_{2}$   $= 0 + \dots + 1 \times 2^{3} + 0 \times 2^{2} + 1 \times 2^{1} + 1 \times 2^{0}$   $= 0 + \dots + 8 + 0 + 2 + 1 = 11_{10}$
- RISC-V uses 64 bits )
  - There are 2<sup>64</sup> combinations
  - 0 to  $(2^{64} 1) \rightarrow 0$  to +18,446,774,073,709,551,615
  - Bit 0 is the least significant bit
  - Bit 63 is the most significant bit



$$2-1 = 2^{4}-1$$
  
= 16-1=15

#### 2s-Complement Signed Integers

Given an n-bit number

iven an n-bit number 
$$x = -x_{n-1} 2^{n-1} + x_{n-2} 2^{n-2} + \dots + x_1 2^1 + x_0 2^0$$

- Xh-1 = 0 positive Range:  $(-2^{n-1})$  to  $(+2^{n-1}-1)$ Xn-1-1- negatile
- Example (32-bit www be)
  - $= 1111 1111 \dots 1111 1100_{2}$   $= -1 \times 2^{31} + 1 \times 2^{30} + \dots + 1 \times 2^{2} + 0 \times 2^{1} + 0 \times 2^{0}$  $= -2,147,483,648 + 2,147,483,644 = -4_{10}$
- Using 64 bits: -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

$$N=4$$
 maximum =  $2^3-1=+7$  minimum =  $-2^3=-8$ 

#### **2s-Complement Signed Integers**

- Bit 63 is sign bit
  - 1 for negative numbers
  - 0 for non-negative numbers
- $-(-2^{n-1})$  can't be represented  $-(2^{n-1})$   $\rightarrow +(2^{n-1}-1)$ 
  - Non-negative numbers have the same unsigned and 2s-complement representation
  - Some specific numbers
    - **0**: 0000 0000 ... 0000
    - −1: (1111 1111 ... 1111
    - Most-negative: 1000 0000 ... 0000

### 2s-Complement Characteristics

- Simple <u>hardware</u> can be used for <u>both</u> signed and unsigned numbers
  - Why not always signed?
    - Some computation deals with numbers as unsigned. For example, memory addresses start at 0 and continue to the largest address. In other words, negative addresses make no sense
- Leading 0 means positive number and leading 1 means negative number
  - So, hardware treats this bit as sign bit
  - Single zero representation



- Imbalance between positive and negative numbers
- Overflow occurs when the sign bit is incorrect



## Signed Negation in 2s-Complement

- First approach: Complement and add 1
  - Complement means 1 → 0, 0 → 1

$$x + x = 1111...111_2 = -1$$

- Second approach:
  - Start from the right
  - Search for the first bit with value 1
  - Complement all bits to the left of the rightmost bit with value 1
- Example: negate +2

$$-2 = 1111 \ 1111 \dots \ 1101_2 + 1$$
$$= 1111 \ 1111 \dots \ 1110_2$$

Why 2s-Complement is called like this? (5)

$$x + (-x) = 10000 \dots .000_2 = 2^n$$
  
 $-x - 2^n - x$ 

$$-x = 2^n - x$$





Chapter 2 — instructions: Language of the Computer — 28



2's complement representation:  $-2^{n-1}$  by  $+(2^{n-1}-1)$ 25 complement operation: negationin the z'r romplement x + (-x) = 10 - - 00002's complement X + (-X) = 260101  $(-X) = 2^n - X$ 00000

Chapter 2 — Instructions: Language of the Computer — 29

\ \ \

## Sign Extension

- Representing a number using more bits
  - Preserve the numeric value
- Replicate the sign bit to the left
  - c.f. unsigned values: extend with 0s = 200 extension
- Examples: 8-bit to 16-bit
  - +2: 0000 00**1**0 => 0000 0000 0000 00**1**0
  - -2: 1111 1110 => (111 1111) 1111 11<sub>1</sub>0 = (-2)<sub>10</sub>
- In RISC-V instruction set -- 00 000 10)-(2)
  - 1b: sign-extend loaded byte
  - 1bu: zero-extend loaded byter (0600 1000) 2 (8)

#### Representing Instructions

- Instructions are encoded in binary
  - Called machine code
- RISC-V instructions
  - Encoded as 32-bit instruction words
  - Small number of formats encoding operation code (opcode), register numbers, ...
  - Regularity!

RISC-V 32-bit instruction

#### Hexadecimal

- Base 16

  Compact representation of bit strings

  - 4 bits per hex digit

| Q | 0000 | 4 | 0100 | 8 | 1000 | С | 1100 |
|---|------|---|------|---|------|---|------|
| 1 | 0001 | 5 | 0101 | 9 | 1001 | d | 1101 |
| 2 | 0010 | 6 | 0110 | а | 1010 | е | 1110 |
| 3 | 0011 | 7 | 0111 | b | 1011 | f | 1111 |

- Example: eca8 6420
  - 1110 1100 1010 1000 0110 0100 0010 0000

#### **RISC-V R-format Instructions**



- rd: destination register number
- funct3: 3-bit function code (additional opcode)
- rs1: the first source register number
- rs2: the second source register number
- funct7: 7-bit function code (additional opcode)



#### R- Lovenet

Instr [31:0] opcode: Instr [6:0] 7-bit 5-bit rd: 1 [11:7] 3-bit Functs: 1/ [14:12] 5-bit :15] (S): 19 5-bit :20] rs2: 1 7 24 :25) 7-bit funct 7: 1 /31

#### R-format Example





#### RISC-V I-format Instructions



- Design Principle 3: Good design demands good compromises
  - Different formats complicate decoding, but allow 32-bit instructions uniformly
  - Keep formats as similar as possible





6 + 2047 -2048 rd 32) (Xo) => (X10) = Memay [32 + 2 (X0) 000000 10 0000 00000 Runch

# RISC-V S-format Instructions



- Different immediate format for store instructions
  - rs1: base address register number
  - rs2: source operand register number
  - immediate: offset added to base address
    - Split so that rs1 and rs2 fields always in the same place



#### **Instructions Opcodes**

| Instruction           | Format | funct7         | rs2   | rs1 | funct3 | rd             | opcode      |
|-----------------------|--------|----------------|-------|-----|--------|----------------|-------------|
| add (add)             | R      | 0000000        | reg   | reg | 000    | reg            | 0110011 -5  |
| sub (sub)             | R      | 0100000        | reg   | reg | 000    | reg            | 0110011 = 5 |
| Instruction           | Format | immed          | liate | rs1 | funct3 | rd             | opcode      |
| add1 (add immediate)  | M      | consta         |       | reg | 000    | reg            | 0010011 =   |
| 1d (load doubleword)  |        | ollo addre     | ss    | reg | 011    | reg            | 0000011 = 3 |
| Instruction           | Format | immed<br>-iate | rs2   | rs1 | funct3 | immed<br>-iate | opcode      |
| sd (store doubleword) | S      | address        | reg   | reg | 011    | address        | 0100011     |

Opcode helps to distinguish which format should be used opened R- format

#### Instruction Representation Example

- Translate the C-statement below to RISC-V machine code:
  - A[30] = h + A[30] + 1;
  - Assume that x10 contains the base address of A and h is mapped to x21
- Solution:
  - First: translate to RISC-V assembly

    | Second of the secon
  - Second: translate to RISC-V machine code

0000



#### **Stored Program Computers**



C compiler

(machine code)

Payroll data

Book text

Source code in C for editor program

- Instructions represented in binary, just like data
- Instructions and data stored in memory
- Programs can operate on programs
  - e.g., compilers, linkers, ...
- Binary compatibility allows compiled programs to work on different computers
  - Standardized ISAs

**Processor** 

CPU

#### **Logical Operations**

Instructions for bitwise manipulation

| Operation      | C           | Java | shift INRISC-V |
|----------------|-------------|------|----------------|
| Shift left     | <b>/</b> << | <<   | sìl, sili      |
| Shift right    | <b>/</b> >> | >>>  | srl, srli      |
| Bit-by-bit AND | &           | &    | and, andi      |
| Bit-by-bit OR  |             |      | or, ori        |
| Bit-by-bit XOR | ٨           | ^    | xor, xori      |
| Bit-by-bit NOT | ~           | ~    | Xori           |



Useful for extracting and inserting groups of bits in a word



# Shift Immediate Operations



- immed: how many positions to shift 00000=(0)
- Shift left logical
  - Shift left and fill with 0 bits
  - slli by i bits multiplies by 2i
- Shift right logical
  - Shift right and fill with 0 bits
  - srli by i bits divides by 2i (unsigned only)



< 64 511i rd, rs, constant shift left logical immediate X7 = 000, 0000 0001 1106 slli x5, x7, 3 80 224 = 28 × 2 = 28 x 8 = 224 511i X23, X22, 4 slli rd, rs1, 64 => (rd) = 0

MORGAN KAUFHANN

add rd, Xo, Xo addi rd, Xo, Chapter 2 — Instructions: Language of the Computer — 45 Srlird, vs1, constant

Chapter 2 — Instructions: Language of the Computer — 46

511 rd, rs, rs2 } R-Format (152) represents the shift amount X6=0--- 0000 1000 511 X5, X7, X6 = (8)10 = slli xs, x7, 8 arithmetic shift. The inserted bits equal the sign bit of sla rd, rs1, rs2 sra

#### **AND Operations**

Biturise AND operation

- Useful to mask bits in a word
  - Select some bits, clear others to 0

- - There is an AND Immediate instruction (andi)
    1-formed
    - Sign-extension for the immediate value

hexadecimal

andi 
$$x9$$
,  $x10$ ,  $0xFFF$ 

64-67

/// ~~--- 1111 Tht Phy ///)

addi xq,x10,0x00F 0101 0000 0000 AND 0000 0006 0006 0101

# OR Operations Bitwise operation

- Useful to include bits in a word
  - Set some bits to 1, leave others unchanged

- x10
- x11
- **x9** 
  - There is an OR Immediate instruction (ori)

### XOR Operations "Bitwise

Differencing operation

$$1 \times = \overline{X}$$

$$0 \times = X$$

T- Format

Invert some bits, leave others unchanged

$$\Rightarrow$$
 \*xor x9,x10,x11 // NOT operation

x10 

x11 1111111 11111111

**x**9

$$(\times q) = (\times_{10})$$

- There is an XOR Immediate instruction (xori)
  - Sign-extension for the immediate value

xori x9, x10, 
$$0xFFF \rightarrow (X_9) = (X_{10})$$



# **Conditional Operations**

- Branch to a labeled instruction if a condition is true
- Otherwise, continue sequentially branch if equal
- beq rs1, rs2, L1

if (rs1) == (rs2) branch to instruction labeled L1

otherwise continue sequentially

branch if not earns

- \_^bne rs1, rs2, L1
  - if (rs1)!=(rs2) branch to instruction labeled L1





#### **Compiling If Statements**

C code:

- f, g, h; in x19, x20, ×21, ×22 (×23)
- Compiled RISC-V code:

```
bne x22, x23, Else beq \times 22, \times 23, L1 add \times 19, x20, x21 beq x0,x0,Exit // unconditional Else: sub x19, x20, x21 L1: add \times 19, \times 20, \times 21 Exit: ...
```

Assembler calculates addresses

i≠j

Else:

ifj

Exit:

f = q - h

(beg -, -, L1 >) \

#### **Compiling Loop Statements**



# **More Conditional Operations**

- blt rs1, rs2, L1 branchif less than
  - if (rs1)< rs2) branch to instruction labeled L1</p>
- bge rs1, rs2, L1 branch if greater than
  - if ((rs1)>= (rs2)) branch to instruction labeled L1
- Example
  - if (a > b) a += 1;
  - a in x22, b in x23

bge x23, x22, Exit

addi x22, x22, 1

Exit:

// branch if b >= a

beq 
$$\times 0$$
,  $\times 23$ ,  $\times 22$ , L1

L1: addi  $\times 22$ ,  $\times 22$ , 1

Exit: - - --

### Signed vs. Unsigned

- Signed comparison: blt, bge
- Unsigned comparison: bltu, bgeu
- Example (-)

  - x22 < x23 // signed blt x22, x23, L1= -1 < +1
  - x22 > x23 // unsigned Hu x22, x23, L1
     +4,294,967,295 > +1

#### **Bounds Check Shortcut**

Treating signed numbers as if they were unsigned gives us a low cost way of checking if 0 ≤x<y, which matches the index out-of-bounds ample

| Sold |

#### Example

Use this shortcut to reduce an index-out-of-bounds check: branch to IndexOutOfBounds if  $x20 \ge x11$  or if x20 is negative.

#### Answer

The checking code just uses unsigned greater than or equal to do both checks:

bgeu x20, x11, IndexOutOfBounds // if x20 >= x11 or x20 < 0, goto IndexOutOfBounds X20>=0 AND X20<X11

continue sequentially

#### **Compiling Loop Statements**

For Loop (C code):

for (i = 0; 
$$i < 10$$
; i++)  
save[i] = save[i] \* 2

- Case/Switch:
  - Simplest way is via sequence of conditional tests → chain of if-else statements
  - Or use branch address table (branch table)
    - Array of double words containing addresses that corresponds to labels in the code
    - The program loads the appropriate entry from the branch table into a register.
    - Then use jump-and-link register (jalr) instruction (Unconditional)

for (i=0; (<10; i++) 1-> ×20 Base address & array save (xe) = (\* 8  $(x_6) = (x_6) + (x_{10})$ (X7) = Save[i] Loop: blt x20, x5, L1 beg xo, xo, Exit L1: 5/11 X6, X20,3

(oad lett RISC-V Assembly add x20, x0, x0 addi X5, X0, 10 Loop: bge X20, X5, Exit slli x6, x20, 3 add ×6, ×6, ×10 Ld X7, 0 (x6) 511i X7, X7, 1 sd x7 ) O(x6). addi x20/ X20/1 beg xo, Xo, Loop



#### **Basic Blocks**

- A basic block is a sequence of instructions with:
  - No embedded branches (except at end)
  - No branch targets (except at beginning)



- A compiler identifies basic blocks for optimization
- An advanced processor can accelerate execution of basic blocks

#### **Procedure Calling**

- Steps required
  - 1. Place parameters in registers x10 to x17
  - 2. (Transfer control to procedure
  - 3 Acquire storage for procedure
  - 4. Perform procedure's operations
  - Place result in register for caller ⇒ <sup>x(0) x 11</sup>
  - 6. Return to place of call (address in x1)

longlong int Summation (intx)
intz)
longlong int sum;

#### **Procedure Call Instructions**

- Procedure call: jump and link jal rd, Label
   jal x1, ProcedureLabel
  - Address of following instruction (PC+4) put in x1
  - 2 Jumps to target address
    - Can also be used for unconditional branch
      - e.g., jal x0, Label (x0 cannot be changed)
- Procedure return: jump and link register jump jalr x0, 0(x1) jalr rd, offset ( $r_{s_i}$ )  $\times_{s_i}$   $\times_{s_i}$ 
  - Like jal, but jumps to 0 + address in x1
  - Use x0 as rd (x0 cannot be changed)
  - e.g., for case/switch statements

Instruction, Memory (IM)

Instructions are encoded using 32-bit PC saves the address of the instruction to be executed PC = PC +4 8-bir



$$(rd) = pc + 4$$
  
 $(rd) = (0 + 4 = 64)$   
 $(x_1) = 64$ 

# **Using More Registers: Stack**

Stack is needed to:

\$50

Stack

Push

Stack

Pop

- Spill registers during procedure execution
  - More registers are needed for execution other than the eight argument registers
- Saving return address or arguments (Non-leaf procedures)

  Define local arrays or structures inside the procedure
- Stack is part of the program memory space and it grows in the direction of decreasing addresses (i.e. from higher to lower addresses
  - Last-In First-Out (LIFO)
  - Access using stack pointer (SP)
     which is register x2
  - SP always points to the top of the stack
  - (Push) → Store in Stack (Subtract from SP)
  - (Pop) → Load from Stack (Add to SP)



SP

Stack

**Empty** 

#### Leaf Procedure Example

```
C code:
 long long int leaf_example (
    long long int g, long long int h,
    long long int i, long long int j) {
   long long int f;
   f = (g + h) - (i + j); body & procedure
   return f:
 Arguments g, h, i, j in, x10, x12, x13
                     return ((g+h)-(i+).
 f in x20
  temporaries x5, x6
  Result in x10
```

# Leaf Procedure Example

#### RISC-V code:

```
% leaf_example:
     addi sp, sp, -24
     x5,16(sp)
104
         x6.8(sp)
 108
     sd
         x20, 0(sp)
     sd
         x5, x10, x11
     add
         x6, x12, x13
     add
120
          \times 20 (x5, x6
     sub
124
          (x10, x20, 0)
     addi
128
          x20,0(sp)
132
     ٦d
         x6,8(sp)
136
          x5,16(sp)
140
     addi sp, sp, 24
    jalr x0,0(x1)
```

Save x5, x6, x20 on stack

$$x5 = g + h$$
  
 $x6 = i + j$   
 $f = x5 - x6$   
copy f to return register

Restore x5, x6, x20 from stack

Return to caller

1 M main: Z = leaf-example (5,7,3,4); // Peall addi sp, sp, -24 addi x10, x0, 5 addix, xo,5 X30, X10, 0

addi sp, sp, -8 sd x20, o(sp) body of procedure sub X10, X5, X6 (d x20,0(5p) addi SPISP18 1al/x0,0(X1) jalv Ko, o(XI)

#### **Local Data on the Stack**





#### Register Usage

- x5 x7, x28 x31: temporary registers
  - Not preserved by the callee
- $\mathbf{x}$ 8  $\mathbf{x}$ 9,  $\mathbf{x}$ 18  $\mathbf{x}$ 27: saved registers
  - If used, the callee saves and restores them
- In the previous example, the caller does not expect x5 and x6 to be preserved. Hence, we can drop two stores and two loads.
- We still must save and restore x20



#### **Non-Leaf Procedures**

- Procedures that call other procedures
- For nested call, caller needs to save on the stack:
- Its return address
- Any arguments and temporaries needed after the call
- Restore from the stack after the call

Non leaf procedures

- "Saved" registers

- "Saved" registers are

- "Saved" registers are

written in the procedure

written in the procedure

Meed to be saved in

need to be saved in

Need to be saved in

Chapter 2—Instructions: Language of the Computer — 75

#### Non-Leaf Procedure Example

```
Iong long int fact (long long int n)

if (n < 1) return 1;

else return n * fact(n - 1);

Multipliation
```

- Argument n in x10
- Result in x10

X0 must besaved in the stack



n = Factorial ofn N1 = N \* (n-1) \* (n-2) \* - - - \* 1 = N \* (n-1)!= N \* (N-1)\*(N-5) 21 - 2 \* 1 - 2 3! = 3\*2 \* 1 = 6 4 x 3! = 4x6 = 24

X30 , main procedure return 4 \* fact(3); return 3x fact(2); long long int fact (longlow int n) N=1 /1 return 1 \* fact(0); return n\* fact(n-1); reburn 1; procedure call Chapter 2 — Instructions: Language of the Computer — 78

### Non-Leaf Procedure Example



addi X5, X10, -1 bge X5, X0, Else addi X10, X0,1 jal Xo, o(X1) addi 59/59,-16 sd X,,8(59) sd X,0(59) addi x (0, x 10, -1 jal X1, fact addi x6, x10,0

Ld X1, 8(sp) Ld X10,0(sp) addi SP, SP, 16 mul X10, X10, X6 jal x , o (x,)

addi 59,59,-16 addi X10/X0, 4 104 1,5d XT , **8**(sp) } Jal X, fact ,5d × (0, o (sp) 108 addi x30/x10,01 Laddi X5, X10, -1 112 -bge K5, X0, L1 120 | addi X10, X0, 1 124 | addi spispi+16 28/3/19/V X0,0(X1) L1: addi ->addi X6, X10,0 140 X1,8(5P) ( -> Ld X10,0(5p) 148 152 addi 59,59,16 X10, X10, X6 156 mul 4x3x2x1=24 X0,0(X1) Chapter 2 — Instructions: Language of the Computer — 81

#### **Local Data on the Stack**



- Local data allocated by callee
  - Local to the procedure but don't fit in the registers (e.g. local arrays and structures)
  - Similar to C automatic variables
- Procedure frame (activation record)
  - The part of the stack containing the procedure saved registers and local variables
  - The Frame Pointer (FP), which is register x8, points to the 1<sup>st</sup> double word of the procedure frame and offers a stable base within a procedure for local memory references



### **Memory Layout**

- 1 Text: program code machine language
- Static data: global variables
  - e.g., <u>static variables</u> in C, constant arrays and <u>strings</u>
  - x3 (global pointer) initialized to address allowing ±offsets into this segment
- Dynamic data: heap
  - E.g., malloc in C, new in Java
  - De-allocation in C using free();
     otherwise, there will be memory leak or dangling pointers
  - Java uses <u>automatic garbage</u> collection

Stack: automatic storage





#### Iteration Vs. Recursion

recursive procedures Some be implemented can iteratively without recursion

Example: C code:

long long int sum (long long int n, long long int ac()

(n > 0)

return sum (n = 1 ) acc +

e]se

return acc:

n is x10, acc is x11, and result in x12

**Iteration** Sum: bge x0, x10, Exit add x11, x11, x10 addi x10, x10, -1 jal <u>x0,</u> Sum <u>z b**e**q ⊁,ו,></u> Exit:

add x12, x11, x0 jalr x0, 0(x1)

return acc;

Recursion

Hacking

Sum:

addi sp, sp, -8 sd x1, 0(sp)

bge x0, x10, Else

add x11, x11, x10 🦇

addi x10, x10, -1

jal x1, Sum

-beg x0, x0, Exit 🧲

Else:

addi x12, x11, 0

Exit:

ld x1, 0(sp) 7 addi sp, sp. 8 jalr x0, 0(x1)









long long int fact ( long long int n)

1 if (n>0)

return n\* fact (ur)

else

return 1;

return ans;

#### RISC-V Register Conventions

| Name             | Register<br>number | Usage                          | Preserved on call? |
|------------------|--------------------|--------------------------------|--------------------|
| x0               | 0                  | The constant value 0           | n.a.               |
| x1 ( <u>ra</u> ) | 1                  | Return address (link register) | X , Caspes         |
| x2 (sp)          | 2                  | Stack pointer Jalv             | Yo , a yes')       |
| x3 (qp)          | 3                  | Global pointer static dala     | yes                |
| x4 (tp)          | 4                  | Thread pointer                 | yes                |
| x5-x7            | 5–7                | Temporaries                    | no                 |
| x8-x9            | 8–9                | Saved                          | yes                |
| x10-x17          | 10–17              | Arguments/results              | no                 |
| x18-x27          | 18–27              | Saved                          | yes                |
| x28-x31          | 28–31              | Temporaries                    | no                 |

Example of "Make the Common Case Fast": 12 saved, 7 temporary, and 8 argument is sufficient most of the time



#### **Character Data**

- Byte-encoded character sets
  - ASCII: 128 characters 7-bir / character
    - ASCII: American Standard Code for Information Interchange
    - 95 graphic, 33 control
    - Size of (1000000000) in ASCII = 10 char \* 8 bits = 80 bits = 10 by hes
    - Size of (1000000000) in Binary = 32 bits
- Latin-1: 256 characters
  - ASCII, +96 more graphic characters
  - Unicode: 32-bit character set
    - Used in Java, C++ wide characters, ...
    - Most of the world's alphabets, plus symbols
    - UTF-8, UTF-16: variable-length encodings

chare cru

UTF-32

1000000 000

> Dala size= 10000



#### Byte/Halfword/Word Operations

- RISC-V byte/halfword/word load/store
  - Load byte/halfword/word: Sign extend to 64 bits in rd

Th rd, offset(rs1)



- lw rd, offset(rs1)

  Load byte/halfword/word unsigned: Zero extend to 64 bits in rd
- lbu rd, offset(rs1)
  - lhu rd, offset(rs1)
  - Twu rd, offset(rs1)
  - Store byte/halfword/word: Store rightmost 8/16/32 bits

$$|| \sum_{x \in S} \frac{7}{7} (x_0)||_{x \in S} = \frac{7}{7} + (x_0) = \frac{7}{7}$$



### **String Copy Example**

> array of characters C code:

Null-terminated string

X[i] - Y[i]

void strcpy (char x[], char y[]) a doesn't en!

while  $((x[i]=y[i()])!='\0')$ 

size-t= unsig

Base addresses of arrays x, y are in x10, x11

i is in x19

X19 is a saved registers

### **String Copy Example**

RISC-V code:

```
ASCII
strcpy:
    addi sp.sp,-8
                           // adjust stack for 1 doubleword
                           // push x19
// Ri=000 di X19, X0, 0
    x19(0)
    add x19, x0, x0
                           // x5 = addr of y[i]
L1: add x5, x19, x11
    Tbu x6.0(x5)
                           //x6 = y[i]
                           // x7 = addr of x[i] Memory - offst, (rs,)
// x[i] = v[i] address
    add x7, x19, x10
        x6,0(x7)
                           // x[i] = y[i<mark>]</mark>
    sb
                           // if y[i] == 0 then exh = 1
    beq x_6, x_0, L_2
    addi x19,x19,1
                           // i = i + 1
    jal x0, L1 ber x0, L1// next iteration of loop//sch i
L2: 1d \times 19,0(sp)
                           // restore saved x19
                           // pop 1 doubleword from stack
    addi sp,sp,8
                          // and return
\rightarrowjalr x0,0(x1)
                                                 memory = (+(VS1))
address
            PC= 0+CXI)
       menory - Mosty baseadless
address i + (X10) < X(i)
```

CXID

### 32-bit Constants (1)

- Most constants are small
  - 12-bit immediate is sufficient
- For the occasional 32-bit constant: lui rd, constant
  - Copies 20-bit constant to bits [31:12] of rd
  - Extends bit 31 to bits [63:32]
  - Clears bits [11:0] of rd to <u>0</u>
- in x19. Formet of "Inti" Constant Ind | oprobe |

lui  $\times 19$ , 976 //  $0 \times 00$ 

addi x19.x19.128 // 0x500

0000 0000 0000 0000 | 0000 0000 0000 0000 | 0000 0000 0011 1101 0000 | 0101 0000 0000

X5, 0x F832A =832A X5= OX FFFFFFF 12-6.7 lui x5, 0x F832A addi x5, x5, 0x471 X5 - [ OX FFFF FFFF F832 A 471] Chapter 2 — Instructions: Language of the Computer — 94

### 32-bit Constants (2)

Example: Write RISC-V instructions to load 0x003D0800 in x19.

$$0 \times FFFF FFFF FFFF = (-1)_{0}$$

$$0 \times 003 D0 + 1 = 0 \times 003 D1$$

lui x19, 977 // 0x003b1

0000 0000 0000 0000 | 0000 0000 0000 0000 | 0000 0000 0011 1101 0001 | 0000 0000 0000

0000 0000 0000 0000 | 0000 0000 0000 0000 | 0000 0000 0011 1101 0000 | 1000 0000 0000

0 × 0000 0000 003 Do 800

X20 the constant 0x97AB2/C3L 0x97AB2 +1 = 0x97AB3 32-bit constant

Addi X20, 0X97AB3

addi X20, X20, 0XC34

X20 = 0XFFFFFFFFFFFFFC34

(-1), 0XFFFFFFFFFFC34

X20 = 0XFFFFFFFF97AB2C34

## Branch Addressing

- Branch instructions specify
  - Opcode, two registers, target address
- Most branch targets are near branch?

S-bit

- Forward or backward
- SB format:



5-bit 3-bit

imm[12] 7-61 r

PC-relative addressing

-Target address = PC + immediate' × 2

The immediate represents the offset in halfwords

memory address of the target instruction

rs, rs,

of the branch ( of the lowers)



X- , X-,. Target address = PC of + immx2 44 40 + im mx2 01 = mm ; 60 0000 1010)2 beg Xo, Xo, (10) - 000 A Target address: 40 + 10 x 2 = 60 half words

Chapter 2 — Instructions: Language of the Computer — 99

## Unconditional Jump-and-Link Addressing

- Jump and link (jal) target uses 20-bit immediate for larger range
- UJ format: Only (jal) uses this format



- PC-relative addressing
  - Target address = PC + immediate × 2
- The immediate represents the offset in halfwords
- The unusual encoding in <u>SB</u> and <u>UJ</u> formats simplifies datapath design but complicates assembly

imm[20:1]

40 jal xo, 64 = 40 + Exit\*2 Exit: 24 = (12)10 64 Exit: - ...

52-b1+ 12-bit branch 151, 152, [immediate] PC, + immediate x 2 Target address = 64-bit more = PC + (Sign-extension) x 2 immediate) x 2 (64-bit) sign-expension 

#### **PC-relative Addressing**

- If absolute addresses were to fit in the 20-bit immediate field, then no program could be bigger than 2<sup>20</sup> bytes, which is far too small to be a realistic option today
- An alternative is to specify a register that would always be added to the branch/jal offset:
  - Allows the program to be as large as 2<sup>64</sup> bytes
  - PC contains the address of the current instruction
  - PC is the ideal choice

- The RISC-V architects wanted to support the possibility of instructions that are only 2 bytes long (i.e. the offset is in halfwords)
  - Maximum offset for branch is  $\underbrace{\pm 2K}_{halfwords} = \underbrace{\pm 4K}_{Bytes}$

Maximum offset for jal is  $\pm 512K$  halfwords =  $\pm 1M$  Bytes



#### Branch Offset in Machine Language

- Loop code from earlier example



Instructions: Language of the Computer — 104

8000 8 bre ×9, ×24,6) => Target address = 80012+ 8 0012 - 80024

# Long Jumps and Branching Far Away

- For long jumps, e.g., to 32-bit absolute address of a procedure laget address; 0x 0000 0000 3 F21
  - lui: load address[31:12] to temp register
  - jalr: add address[11:0] and jump to target

- assembler rewrites the code
  - Example



lui x5, 0x3 F21A

### RISC-V Addressing Summary



# **RISC-V Encoding Summary**

| imm | 20:13 |
|-----|-------|
|     |       |

| Name       |            | Field               |                   |         |        |               |                              | Comments                      |
|------------|------------|---------------------|-------------------|---------|--------|---------------|------------------------------|-------------------------------|
| (Field Siz | <b>æ</b> ) | 7 bits              | 5 bits            | 5 bits  | 3 bits | 5 bits        | 7 bits                       |                               |
| R-type     |            | funct7              | rs2               | rs1     | funct3 | rd            | opcode                       | Arithmetic instruction format |
| I-type     |            | immediate[11:0] rs1 |                   | funct3  | rd     | opcode        | Loads & immediate arithmetic |                               |
| S-type     |            | immed[11:5]         | rs2               | rs1     | funct3 | immed[4:0]    | opcode                       | Stores                        |
| SB-type    |            | immed[12,10:5]      | rs2               | rs1     | funct3 | immed[4:1,11] | opcode                       | Conditional branch format     |
| UJ-type    |            | imme                | ediate[20,10:1,11 | ,19:12] | 3      | rd            | opcode                       | Unconditional jump format     |
| U-type     |            | immediate[31:12]    |                   |         | \$     | rd            | opcode                       | Upper immediate format \      |
|            |            | Funct 6.            | shiffamout        | VSI     | fund   | by rd         | o prock                      | slli, svli                    |
|            |            |                     | A 10 %            |         |        |               |                              |                               |

6-bit

lui X5,0XF53B0 jmn

0xF53B0

imn rd opcod x F53 B0 00101

MORGAN KAUFMANN



### COMPUTER ORGANIZATION AND DESIGN

The Hardware/Software Interface



# **Chapter 4**

**The Processor** 

### Introduction

- CPU performance factors
  - Instruction count
    - Determined by ISA and compiler
  - CPI and Cycle time
    - Determined by CPU hardware
- We will examine two RISC-V implementations
- ∠ ① A simplified version (i.e. Single-Cycle implementation)
  - 2 Multi-Cycle implementation Appendix C
- A more <u>realistic</u> pipelined version
  - Simple subset, shows most aspects
    - Memory reference: 1d, sd
    - Arithmetic/logical: add, sub, and, or
    - Control transfer: beq



### Instruction Execution

- PC → instruction memory, fetch instruction
- Register numbers → register file, read registers
- Depending on instruction class
  - Use ALU to calculate
    - Arithmetic result R-Formt (Vs 1) operation (Vs2)
  - Memory address, for load/store Memory = ((s)) + (officially)

     Branch comparison ((s)) (vs2) => Zers Flag
  - Access data memory for load/store
    - C ← target address or PC + 4
      - Write Regult to "vo

#### **CPU Overview** Sd VS2, of Bet (V3)) Mem-y (VS1) + 597-7= (V52) (5134) Target Add Data (rs1) Register # ALU 4 Address PC Address Instruction Registers Register # Data Instruction (Ys2) memory Register # memory Data in 8-614 read vegistrs Instructionfetch Chapter 4 — The Processor — 4

# Multiplexers



### **Control**



# **Logic Design Basics**

- Information encoded in binary
  - Low voltage = 0, High voltage = 1
  - One wire per bit
  - Multi-bit data encoded on multi-wire buses
  - Combinational element >
    - Operate on data
    - Output is a function of input
- State (sequential) elements

- Store information: RF, Instructione & Data



### **Combinational Elements**

AND-gate

$$\mathbf{Y} = A \& B$$

Adder

$$\underline{Y} = A + B$$



Arithmetic/Logic Unit

Multiplexer

$$Y = S ? 11 : 10$$

$$10 \rightarrow M \\ u \\ x$$

$$S = S$$



# Sequential Elements

dock Rate = 1

- Register: stores data in a circuit
  - Uses a clock signal to determine when to update the stored value
  - Edge-triggered: update when Clk changes



# **Sequential Elements**

- Register with write control
  - Only updates on clock edge when write control input is 1
  - Used when stored value is required later





bey 15, 15 glabel 5 d 152, offsetiles

# Clocking Methodology



- read or written from storage elements while reading to
  - This is important because if a signal is to be read and written simultaneously, then the value read could correspond to the old value, new value, or a mix
- Most CPUs use edge-triggered clocking methodology ~ 'Q
  - This methodology allows us to read the contents of a state element at the beginning of the clock cycle, send the value through Combinational logic during the clock cycle, then write the output to the same state element or another state element at the end of the clock cycle

Longest Combinational logic delay determines clock period







### Building Datapath for Single-Cycle CPU

- Every instruction is executed in one cycle
- CPS = 1
- No datapath resource can be used more than once per instruction → any element needed more than
  - once must be duplicated

- Elements that process data and addresses in the CPU
  - Registers, ALUs, mux's, memories, ...
- We will build a RISC-V datapath incrementally
  - Refining the overview design



### Instruction Fetch: read instruction from instruction memory Add Read PC Increment by address 32-bil 4 for next instruction Instruction 64-bit Clle constant during the clock register Instruction read memory the value of PC

Chapter 4 — The Processor — 14

# **R-Format** Instructions

- Read two register operands
- Perform arithmetic/logical operation

Write register result



5-6+ 5-61 R-bornet imm 5- hormet [nstr [31:0]

> Instr[11:7] = "rd" number [nstr [19:15] =)"VSI" Instr [24:20]=" V52" read is harmless but write is harmful

# Register File



# Register File – Read Ports



Use two 64-bit 32-to-1 multiplexers whose control lines are the register numbers.



# Register File – Write Port



We need 5-to-32 decoder in addition to the Write signal to generate actual write signal

The register data is common to all registers



Reg

Wife

RegWrite = 1? For which instructions - addi - Load - jal - R-Pormat - jalv \_ (ul

instructions Rywrite = 0? - store - branch



### **Load/Store Instructions**

- Read register operands

  Calculate address using 12-bit offset

  Memory = (base register) + extraction (fig.4)

  Calculate address using 12-bit offset
- - Use ALU, but sign-extend offset
- Load: Read memory and update register
- Store: Write register value to memory





I muce late Generation 1) Extract immediate from Instruction instr[31: 20] > Instr [31:25], Instr [11:7] -extension Replicate Inst/[3] 52 times Chapter 4 — The Processor — 24



### **Branch Instructions**

- Read register operands RD1
- Compare operands (151) (152)
  - Use ALU, subtract and check Zero output
  - Taken Branch: Condition is true → Jump to branch target
  - Not-Taken Branch: Condition is false → Execute instruction next to branch
- Sign-extend displacement BTA: PC + sign-ext (imm)X2 Calculate target address

  - Shift left 1 place (halfword displacement)
  - Add to PC value



### **Branch Instructions**







Shiff by 1 to the left



# Composing the Elements

- First-cut data path does an instruction in one clock cycle
  - Each datapath element can only do one function at a time
  - Hence, we need separate instruction and data memories. Also we need separate Adders for (PC + 4) and BTA calculation beside the main ALU
- Use multiplexers where alternate data sources are used for different instructions

# R-Type/Load/Store Datapath





# Full Datapath of Single-Cycle CPU



### **ALU Control**

ALU used for

- Load/Store: F = add
- Branch: F = subtract (rs1) (rs2)
- R-type: F depends on opcode Ainvert, Bregate, Operation (1:0)

**ALU** control **Function** 0000 AND 0001 OR 0010 add 0110 subtract

0111



and 01



### **ALU Control**

- Assume 2-bit ALUOp derived from opcode
  - Combinational logic derives ALU control

| Instruction opcode | ALUOp | Operation        | Funct7<br>field | Funct3<br>field | Desired<br>ALU action | ALU control input |
|--------------------|-------|------------------|-----------------|-----------------|-----------------------|-------------------|
| ld                 | 00    | load doubleword  | XXXXXXX         | XXX             | add                   | 0010              |
| sd                 | 00    | store doubleword | XXXXXXX         | XXX             | add                   | 0010              |
| beq                | 01    | branch if equal  | XXXXXXX         | XXX             | subtract              | 0110              |
| R-type             | 10    | add              | 0000000         | 000             | add                   | 0010              |
| R-type             | 10    | sub              | 0100000         | 000             | subtract              | 0110              |
| R-type             | 10    | and              | 0000000         | 111             | AND                   | 0000              |
| R-type             | 10    | or 10            | 0000000         | 110             | OR                    | 0001              |



## **The Main Control Unit**

Control signals derived from instruction's opcode

| Name          |                |                 |       |        |               |        |
|---------------|----------------|-----------------|-------|--------|---------------|--------|
| (Bit position | 31:25          | 24:20           | 19:15 | 14:12  | 11:7          | 6:0    |
| _             |                |                 |       |        |               |        |
| (a) R-type    | funct7         | rs2             | rs1   | funct3 | rd            | opcode |
|               |                |                 |       |        |               |        |
| (b) I-type    | immediate      | immediate[11:0] |       | funct3 | rd            | opcode |
| _             |                |                 |       | •      |               |        |
| (c) S-type    | immed[11:5]    | rs2             | rs1   | funct3 | immed[4:0]    | opcode |
| _             |                |                 |       | •      |               |        |
| (d) SB-type   | immed[12,10:5] | rs2             | rs1   | funct3 | immed[4:1,11] | opcode |
| _             |                |                 |       |        |               |        |



# **Datapath With Control**



### R-Type Instruction\_







# Datapath With Control S d (2, offs d (151) (151) (151) (151)



### **Control Signals**

- 6 single-bit control signals plus 2-bit ALUop signal
  - Total of 8 control signals
- Asserted means that signal is logically high
- Deasserted means that signal is logically low

| Signal name    | Effect when deasserted                                                           | Effect when asserted                                                                                    |  |  |  |  |
|----------------|----------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|--|--|--|--|
| RegWrite       | None.                                                                            | The register on the Write register input is written with the value on the Write data input.             |  |  |  |  |
| ALUSrc         | The second ALU operand comes from the second register file output (Read data 2)  | The second ALU operand is the sign-extended, 12 bits of the instruction.                                |  |  |  |  |
| PCSrc<br>22eso | The PC is replaced by the output of the adder that computes the value of PC + 4. | The PC is replaced by the output of the adder that computes the branch target.                          |  |  |  |  |
| MemRead        | None.                                                                            | Data memory contents designated by the address input are put on the Read data output.                   |  |  |  |  |
| MemWrite       | None.                                                                            | Data memory contents designated by the address input are replaced by the value on the Write data input. |  |  |  |  |
| MemtoReg       | The value fed to the <u>register Write</u> data input comes from the ALU.        | The value fed to the register Write data input comes from the data memory.                              |  |  |  |  |



# Control Unit or co



- The input is the Opcode field (7 bits) from the instruction register
- The output is 8 control signals

Input or

Control signals can be set solely based on the opcode field, except
 PCSrc (i.e. PCSrc = Branch AND Zero)

|              | output  | Signal name | R-format | ld | sd | beq |
|--------------|---------|-------------|----------|----|----|-----|
|              | Inputs  | I[6]        | 0        | 0  | 0  | 1   |
|              |         | I[5]        | 1        | 0  | 1  | 1   |
|              |         | I[4]        | 1        | 0  | 0  | 0   |
|              |         | I[3]        | 0        | 0  | 0  | 0   |
|              |         | I[2]        | 0        | 0  | 0  | 0   |
| 1991         | 2       | l[1]        | 1        | 1  | 1  | 1   |
| MUJOR) MUJOS | `       | I[O]        | 1        | 1  | 1  | 1   |
| רו           | Outputs | ALUSrc -    | 0        | 1  | 1  | 0   |
| 1            |         | MemtoReg    | 0        | 1  | X  | X   |
| 00 (+)       |         | RegWrite    | 1        | 1  | 0  | 0   |
| 01(-)        |         | MemRead     | 0        | 1+ | 0  | 0   |
|              |         | MemWrite    | 0        | 0  | 1  | 0   |
| To bth       | e       | Branch      | 0        | 0  | 0  | 1   |
| -            |         | ALUOp1      | 1.*      | 0  | 0  | 0   |
|              |         | ALUOp0      | 0        | 0  | 0  | 17  |

### **Control Unit Design**



### **ALU Control**

- The ALU control has two inputs: Field of Control Aincet, Brugare,

  1. ALUOp (2 bits) from the control unit
  - 2. Funct3 and Funct7 fields from the instruction register

00

The ALU control has a 4-bit output

| Function     | Ainvert Operation [3] | Bnegate Operation [2] | Operation [1:0] |  |  |
|--------------|-----------------------|-----------------------|-----------------|--|--|
| →and         | 0                     | 0                     | 00)             |  |  |
| → or         | 0                     | 0                     | 01              |  |  |
| <b>→</b> add | 0                     | 0                     | 10              |  |  |
| <b>≯</b> sub | 0                     | 1                     | 10              |  |  |
| 21t          | 0                     | 1                     | 12              |  |  |



### **ALU Control**





### **Control Unit Design**



### Datapath With Control "addi"





# Control Unit Design "addi"





### ALU Control (New Instruction Example)

| AL     | JOp                            |       |       | Fu           | nct7 fie     | eld      |              |       | Funct3 field |                      |        |           |
|--------|--------------------------------|-------|-------|--------------|--------------|----------|--------------|-------|--------------|----------------------|--------|-----------|
| ALUOp1 | ALUOp0                         | I[31] | 1[30] | <b>I[29]</b> | <b>I[28]</b> | 1[27]    | <b>I[26]</b> | 1[25] | 1[14]        | I[13]                | 1[12]  | Operation |
| 0      | 0                              | X     | Х     | Х            | Х            | X        | Х            | X     | X            | X                    | X      | 0010      |
| 0      | 1                              | Х     | Х     | X            | Х            | Χ        | Х            | X     | Х            | X                    | Χ      | 0110      |
| 1      | 0                              | 0     | 0     | 0            | 0            | 0        | 0            | 0     | 0            | 0                    | 0      | 0010      |
| 1      | 0                              | 0     | 1     | 0            | 0            | 0        | 0            | 0     | 0            | 0                    | 0      | 0110      |
| 1      | 0                              | 0     | 0     | 0            | 0            | 0        | 0            | 0     | 1            | 1                    | 1      | 0000      |
| 1      | 0                              | 0     | 0     | 0            | 0            | 0        | 0            | 0     | 1            | 1                    | 0      | 0001      |
| T T    | 1)                             | XX    | X -   | - X          | -X           | <b>X</b> | ×            | X     | 6            |                      | للك    | 0001      |
|        | = [                            | 0 —   |       |              |              |          |              |       | Opera        | ition[3]             |        |           |
|        | I[30<br>ALUOp<br>ALUOp<br>I[14 | 01    |       |              |              |          |              |       | •            | ition[2]<br>ition[1] | _<br>} | tw        |
| A      | :1]ا<br>1[1] - ام ق            |       |       |              | )            |          | lop1_        |       | Opera        | ition[0]             |        | opeahi    |

### Performance Issues

- Longest delay determines clock period
  - Critical path: load instruction
  - Instruction memory → register file → ALU → ALU → Compare file
     description memory → register file
- Not feasible to vary period for different instructions
- Violates design principle
  - Making the common case fast
- We will improve performance by pipelining

### **Pipelining**

An implementation technique in which multiple instructions are overlapped in execution

C.f. Single-Cycle and Multi-Cycle implementations in which only a single instruction is executing at any time

Today, pipelining is nearly universal

# Pipelining Analogy 2 M + 155

- Pipelined laundry: overlapping execution
  - Parallelism improves performance





- Speedup = 8/3.5 = 2.3
- Non-stop:
  - Speedup
    - $= 2n/0.5n + 1.5 \approx 4$
    - = number of stages

pipelined laundry 14 2.3 times

Chapter 4 — The Processor — 52



D

D

### **RISC-V** Pipeline

5-stege pipeline

- Five stages, one step per stage
  - 1. IF: Instruction fetch from memory
  - 2. ID: Instruction decode & register read
  - 3. EX: Execute operation or calculate address
  - 4. MEM: Access memory operand
  - 5. WB: Write result back to register

- Stages can operate concurrently as long as separate resources are available for each stage
  - Pipeline design is based on Single-Cycle CPU design



# Pipeline Performance

Assume time for stages is

100ps for register read or write

■ 200ps for other stages ALU, Memory

Compare pipelined datapath with single-cycle CC pipeline = 200 PS datapath

| Instr    | Instr fetch | Register read | ALU op | Memory access | Register<br>write | Total time |
|----------|-------------|---------------|--------|---------------|-------------------|------------|
| Id       | 200ps       | 100 ps        | 200ps  | 200ps         | 100 ps            | 800ps      |
| sd       | 200ps       | 100 ps        | 200ps  | 200ps         | X                 | 700ps      |
| R-format | 200ps       | 100 ps        | 200ps  | $\times$      | 100 ps            | 600ps      |
| beq      | 200ps       | 100 ps        | 200ps  | ×             | X                 | 500ps      |



Local instructions => n instructions CPU time SC = 800 XN 1000 X n => Tor balanced Stages CPU time pipeline = ,5x200, + (n-1) x 200  $= 1000 + (N-1) \times 200$ = 800 + 200 XM 800×× - 800 - 4 800 + 200XX 4 dims fester than 5C Chapter 4 — The Proc

### Pipeline Speedup

- If all stages are balanced
  - i.e., all take the same time
  - Time between instructions pipelined
    - Time between instructions<sub>nonpipelined</sub>
       Number of stages
- If not balanced, speedup is less
- Speedup due to increased throughput
  - Latency (time for each instruction) does not

decrease



Non stop instructions

## Pipelining and ISA Design

- RISC-V ISA designed for pipelining
  - All instructions are 32-bits
- CISC
- Easier to fetch and decode in one cycle
- c.f. x86: 1- to 17-byte instructions
- Few and regular instruction formats
  - Can decode and read registers in one step
- Load/store addressing
  - Can calculate address in 3<sup>rd</sup> stage, access memory in 4<sup>th</sup> stage

### **Hazards**

- Situations that prevent starting the next instruction in the next cycle
- Structure hazards
  - A required resource is busy
- Data hazard
  - Need to wait for previous instruction to complete its data read/write
- Control hazard
  - Deciding on control action depends on previous instruction

### Structure Hazards: Structural Hazards

- Conflict for use of a resource
- multiple instructions trying to use the
- In RISC-V pipeline with a single memory resource
  - Load/store requires data access
  - Instruction fetch would have to stall for that cycle
    - Would cause a pipeline "bubble"
- Hence, pipelined datapaths require separate instruction/data memories
  - Or separate instruction/data caches
- RISC-V pipeline is designed carefully such that there are no <u>structure</u> hazards

### **Data Hazards**

write in cc5 read in cc3



- add rdx19, x0, x1
  sub x2, x19, x3
- Pipeline must be stalled (i.e. stopped)



ccl X5, X4

add x20, EMW

WAW: Writz After Write ] MAR: 1/4 Read Jin

In pipelining, all instructions go through the five stages to avoid structural Hazards. add Structual Hazard on RF write port

# Forwarding (aka Bypassing)

- Use result when it is computed
  - Don't wait for it to be stored in a register
  - Requires extra connections in the datapath
- Forwarding paths are valid only if the destination stage is later in time than the source stage
  - Ex: There can't be a path from the output of the memory stage to the input of the execution stage



new value of K5 is computed add (x5), x20, x21 Ld X6,0((X5)) F D E. M W
F D E M W memory address = 0 + (Xs) Which instructions can forward data & From which Stages? 1) R-type can forward from MEM or WB (2) Load can forward from WB. Destination Stage in forwarding

is always the Exstage Chapter 4—The Processor—66

### **Load-Use Data Hazard**

- Can't always avoid stalls by forwarding
  - If value not computed when needed

Can't forward backward in time! Program execution 200 1200 400 600 800 1000 1400 order Time Sub (in instructions) IF ID MEM **WB** EX `bubble<sub>r</sub> `bubble<sub>1</sub> `bubble<sub>r</sub> `bubble<sub>r</sub> `bubble<sub>r</sub> EX WB sub x4,(x1), x5 ID MEM

Chapter 4 — The Processor — 67

72(Xo) old values of K6 &X7 - Fwd - W-to- EX - rs, - Fwd \_ M-10-E-152

### Code Scheduling to Avoid Stalls

- Reorder code to avoid use of load result in the next instruction
- C code for a = b + e; c = b + f;
- Assume that a, b, c, e, f are in memory with offsets (24, 0, 32, 8, 16) from xQ.





### **Control Hazards**





- Branch determines flow of control
  - Fetching next instruction depends on branch outcome
     Pipeline can't always fetch correct instruction
  - - Still working on ID stage of branch



- In RISC-V pipeline
  - Need to compare registers and compute target early in the pipeline "Deede
  - Add hardware to do it in ID stage



### Stall on Branch

 Wait until branch outcome determined before fetching next instruction



#### **Branch Prediction**

- Longer pipelines can't readily determine branch outcome early
  - Stall penalty becomes unacceptable
- Predict outcome of branch
  - Only stall if prediction is wrong
- In RISC-V pipeline 9 ways
  - Can predict branches not taken
  - Fetch instruction after branch, with no delay

Branch predictor -> with 50% accuracy 100 branch -> 50 Cydes as lost

# Predictor Types

- Hways Taken Predictor Not Taken Predictor
- 2) Static Predictor
- Dynamic Predictor

#### More-Realistic Branch Prediction

- Static branch prediction
  - Based on typical branch behavior
  - Example: loop and if-statement branches
    - Predict backward branches taken
    - Predict forward branches not taken
- beg predic Not Talem

- Dynamic branch prediction
  - Hardware measures actual branch behavior
    - e.g., record recent history of each branch
  - Assume future behavior will continue the trend
    - When wrong, stall while re-fetching, and update history

will be executed predict as Not Take -times "not Taken" one-time Taken beg xo, xo, loop -> "predict Taken" n-times => Prediction is. prediction is wong and one Eycle strehapter 4 — The Processor — 75

" Predictas Branch

# **RISC-V Pipelined Datapath**





## Pipeline registers

- Need registers between stages
  - To hold information produced in previous cycle
- Stage registers must be wide enough to store all the data corresponding to the lines that go through them
  - IF/ID is 96-bit: 32-bit instruction and 64-bit for PC
  - ID/EX is <u>256-bit</u>: (rs1), (rs2), Sign-Extended value, and PC
  - EX/MEM is 193-bit and MEM/WB is 128-bit





## **Pipeline Operation**

- Cycle-by-cycle flow of instructions through the pipelined datapath
  - - Shows pipeline usage in a single cycle
    - Highlight resources used
  - c.f. "multi-clock-cycle" diagram
    - Graph of operation over time
- We'll look at "single-clock-cycle" diagrams for load & store



## IF for Load, Store, ...



## ID for Load, Store, ...





#### **EX** for Load

Memory address = (Vs,) TSign-ear Cimm)



#### **MEM for Load**



ld



Corrected Datapath for Load MEM/WB IF/ID Add Sun BTA Shift left 1 Read register 1 data 1 Read Read Address data 2 Data memory left to right
passes through
pipeline registers

MORGAN KAUFMANN

through pipelin regists

Chapter 4 — The Processor — 85

#### **EX for Store**





#### **MEM for Store**



#### not highlighted **WB** for Store sd Write-back IF/ID EX/MEM MEM/WB not highlighted Add Add Sum Shift left 1 MS 9ns Address Read Real register 1 data Read Reau register 2 Registers Ins ALU Instruction Read Read Address memory result] data Write data 2 Data Write data Not highlighted

# Multi-Cycle Pipeline Diagram

Form showing resource usage



# Multi-Cycle Pipeline Diagram

#### Traditional form

Program execution order (in instructions)

ld x10, 40(x1)

sub x11, x2, x3

add x12, x3, x4

ld x13, 48(x1)

add x14, x5, x6



# Single-Cycle Pipeline Diagram

State of pipeline in a given cycle (<<5)</p>



# Pipelined Control (Simplified)





# **Pipelined Control**

Control signals derived from instruction





#### **Pipelined Control** ID/EX. RegWile ID/EX. Mento Reg Instv 1 Instr3 nstrz EX/MENZS WILL Instis Menwih MEMMB. Rowsile Branch MEMMB. Manbles ALUSTE ALUON Branch Add Add Sum Shift → Address Read register 2 Instruction er ∠ Registers Read Read Address memory result data Data memory Write Instruction IDEX. RSI lmm Instruction ALU [30, 14-12] MEM MB. Rd instruction [11-7] EX/MEM·Rd IF/ID. ID/EX.Rd

cc3 | cc4 | cc5 cc6 cc7 cc8 cel ccz In ccs, E M MEM INB. Mentileg \*Ld \* add Sub the value of EX/MEM. MemPard? In CCY whatis the value of ID/EX. ManRed? In ccu answer = 0 is the value of Man Redd? In CC4 what Angwer = 0 INKCS, ID/EX. A LUSTE = ) EXIMEM. Reswrite = 1 In CCS, The Processor 🚣 95

#### **Data Hazards in ALU Instructions**

Consider this sequence:

```
sub x^2, x^3, x^3, x^3, x^3, x^3, x^3, x^3, x^3, x^4, x^2, x^3, x^3, x^4, x^2, x^3, x^3, x^4, x^2, x^3, x^3, x^4, x^2, x^3, x^4, x^4, x^2, x^2, x^4, x^4
```

- We can resolve hazards with forwarding
  - How do we detect when to forward?

# Dependencies & Forwarding



#### Detecting the Need to Forward

- Pass register numbers along pipeline
  - e.g., ID/EX.RegisterRs1 = register number for Rs1 sitting in ID/EX pipeline register
- ALU operand register numbers in EX stage are given by
  - ID/EX.RegisterRs1, ID/EX.RegisterRs2

    Data hazards when 

    New destination register Name of the destination register Name of the destination in Memory.
- Data hazards when
  - 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs1
  - 1b. EX/MEM.RegisterRd = ID/EX.RegisterRs2
  - 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs1
  - 2b. MEM/WB.RegisterRd = ID/EX.RegisterRs2





### **Detecting the Need to Forward**

- But only if forwarding instruction will write to a register!
  - EX/MEM.RegWrite, MEM/WB.RegWrite

And only if Rd for that instruction is not x0

EX/MEM.RegisterRd ≠ 0,
 MEM/WB.RegisterRd ≠ 0





# **Forwarding Conditions**

| Mux control          | Source  | Explanation                                                                    |
|----------------------|---------|--------------------------------------------------------------------------------|
| ForwardA = 00        | (D/EX)  | The <u>first ALU operand</u> comes from the <u>register file</u> .             |
| ForwardA = <u>10</u> | EX/MEM) | The <u>first ALU</u> operand is forwarded from the prior <u>ALU result.</u>    |
| ForwardA = 01        | MEM/WB  | The first ALU operand is forwarded from data memory or an earlier ALU result.  |
| ForwardB = 00        | ID/EX   | The second ALU operand comes from the register file.                           |
| ForwardB = 10        | EX/MEM  | The second ALU operand is forwarded from the prior ALU result.                 |
| ForwardB = 01        | MEM/WB  | The second ALU operand is forwarded from data memory or an earlier ALU result. |

#### **Forwarding Conditions** EX hazard if (EX/MEM.RegWrite) and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs1)) ForwardA = (10)if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs2)) orwardB = 10MEM hazard MEM JUB 10 00 VS) if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs1)) ForwardA = (01)if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs2)) ForwardB = 0 8(3) are False => Formal A = 00 1, 1 => Forward B = 00

#### **Double Data Hazard**

- Consider the sequence:
  - $\emptyset$  add(x),x1,x2
- add x1,x1,x3
- 3 add x1, x1, x4
- F D E M W M F D E M W
- Both hazards occur
  - Want to use the most recent
- Revise MEM hazard condition
  - Only fwd if EX hazard condition isn't true

### **Revised Forwarding Condition**

```
MEM hazard MEM / WB <
```

if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and not EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs1) and (MEM/WB.RegisterRd = ID/EX.RegisterRs1) ForwardA = 01

• if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and not EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs2))

and (MEM/WB.RegisterRd = ID/EX.RegisterRs2)) ForwardB = 01

# **Datapath with Forwarding**



(CCY /CCS/CC6) cc7 adog  $(x_5), o(x_0)$ sub Xq (X7), X23 in cc3: No bornarding in CC5: Fwd-fron-WB-to-VSz, Fwd-from-Mem-to in CCG: No horwarding in (C7: Fund- from MEAN Processor∙— 106

#### **Load-Use Hazard Detection**

- The Check when using instruction is decoded in D stage
- are given by
  - IF/ID.RegisterRs1, IF/ID RegisterRs2
- Load-use hazard when
  - ID/EX.MemRead and (ID/EX.RegisterRd ≠ 0) and (ID/EX.RegisterRd = IF/ID.RegisterRs1) or (ID/EX.RegisterRd = IF/ID.RegisterRs2))
- If detected, stall and insert bubble



# **How to Stall the Pipeline**

- to 0 Reswrite = Branch = Branc
  - EX, MEM and WB do nop (no-operation)
- Prevent update of PC and IF/ID register
  - Using instruction is decoded again
  - Following instruction is fetched again
  - 1-cycle stall allows MEM to read data for 1d
    - Can subsequently forward to EX stage





# **Load-Use Data Hazard**



branch Chapter 4 — The Processor — 110

# Datapath with Hazard Detection



In cc 5 => EX/MEM. Memwrite = 0 In C(5 =) Forward A = 01 Forward B = 00

# **Appendix A.5:**

# Constructing a Basic Arithmetic Logic Unit (ALU)

Dr. Gheith Abandah

[Adapted from the slides of Professor Mary Irwin (<a href="www.cse.psu.edu/~mji">www.cse.psu.edu/~mji</a>) which in turn Adapted from Computer Organization and Design,

Patterson & Hennessy]

#### **RISC-V Number Representations**

64-bit signed numbers (2's complement):

- Converting <64-bit values into 64-bit values</p>
  - copy the most significant bit (the sign bit) into the "empty" bits

• sign extend versus zero extend (1b vs. 1bu)

### **RISC-V Arithmetic Logic Unit (ALU)**

Must support the Arithmetic/Logic operations of the ISA

```
add, addi sub and, andi, or, ori, xor, xori beq, bne slt, slti, sltiu, sltu
```

- With special handling for
  - sign extend addi, andi, ori, xori
  - zero extend lbu
- □ RISC-V world is 64-bit wide → 64-bit wide ALU
- First, generate 1-bit ALU slice then replicate 64 times
- ALU is constructed from: AND, OR, inverters, MUXes



# **Review: 2's Complement Binary Representation**

Negate

$$-2^{\circ}$$
 -  $-(2^{3} - 1) =$ 

| 2'sc binary | decimal    |
|-------------|------------|
| 1000        | -8         |
| 1001        | -7         |
| 1010        | -6         |
| 1011        | -5         |
| 1100        | -4         |
| 1101        | -3         |
| 1110        | -2         |
| 1111        | -1         |
| 0000        | 0          |
| 0001        | + 1        |
| 0010        | + 2        |
| 0011        | <b>→</b> 3 |
| 0100        | + 3<br>+ 4 |
| 0101        | <b>4</b> 5 |
| 0110        | <b>+</b> 6 |
| 0111        | <b>→</b> 7 |

 $N \Rightarrow -N$   $N \Rightarrow -N + 1$ 

N: N invert

- N neg tion

□ Note: negate and

invert are different!

1011

1010

and add a 1

complement all the bits

#### **Binary Addition**



# Review: Half (2,2) Adder and Full (3,2) Adder



$$S = A \oplus B$$
 (odd parity function)

Cout = A&B (majority function)

$$S = \underline{A} \oplus \underline{B} \oplus \underline{Cin} \quad (odd parity function)$$

- ☐ Howacantwe) use it to build a 64-bit adder?
- How can we modify it easily to build an adder/subtractor?

# A 64-bit Ripple Carry Adder/Subtractor

add/sub

- Remember 2's complement is just
  - complement all the bits

control
$$(0=\text{add}, 1=\text{sub}) \longrightarrow B_0 \text{ if control} = 0,$$

$$B_0 \text{ if control} = 1$$

$$B_0 \text{ if control} = 1$$

add a 1 in the least significant bit

A 0111 
$$\rightarrow$$
 0111

B  $-$  0110  $\rightarrow$  + 1001

0001  $-$  1

 $\beta$  -  $\beta$  = A +  $\beta$  +  $\beta$  +  $\beta$  +  $\beta$  A -  $\beta$  = A +  $\beta$  +  $\beta$ 



#### Tailoring the ALU to the RISC-V ISA

- □ Need to support the logic operations (and, or, xor)
  - Bit wise operations (no carry operation involved)
  - Need a logic gate for each function, mux to choose the output
- Need to support the set-on-less-than instruction (slt)
  - Use subtraction to determine if (a − b) < 0 (implies a < b)</li>
  - Copy the sign bit into the low order bit of the result, set remaining result bits to 0
- Need to support test for equality (bne, beq)
  - Again use subtraction: (a b) = 0 implies a = b
  - Additional logic to "nor" all result bits together
- Immediates are sign extended outside the ALU with wiring (i.e., no logic needed)

#### **RISC-V ALU**

**CPE 232 MIPS Arithmetic** 







| and | × 10/ | Xv, | , X30 |
|-----|-------|-----|-------|

|             |         |           | 1  |
|-------------|---------|-----------|----|
| Function    | Binvert | Operation |    |
| ✓ and       | 0       | 00        | 0  |
| ✓ <u>or</u> | 0       | 01        | 6  |
| ✓ add       | 0       | 10        | O  |
| ✓ sub       | 1       | 10        | 0  |
| slt         | 1       | 11        | 0  |
| Nov         | 1       | 00        | 11 |

Result 4-6-1 MUX

 $A - B = A + \widehat{B} + 1$   $Nor rd) rs_1, rs_2$   $Not ((rs_1) or (rs_2))$   $not ((rs_1) and not (rs_2)$ 





## Set if less than (slt)





• If(rs1)>=(rs2)
$$\rightarrow$$
 rd = (0000000.....0000)<sub>two</sub>



- □ For ALU<sub>1</sub> to ALU<sub>63</sub>, connect the **Less** input to ground or slices most significant zero
- □ Use subtraction to determine if rs1 is less than rs2
  - rs1 -rs2 is negative  $\rightarrow$  adder/subtractor output in ALU<sub>63</sub> is 1  $_{\circ}$  • The adder/subtractor output of ALU<sub>63</sub> is called <u>Set</u> and is
  - connected to the Less input of ALU<sub>0</sub>
- □ Since we need a special ALU slice for the most significant digit to generate the Set output, we add the functionality needed to generate the overflow detection logic since it is associated with that bit too SIFy

#### **RISC-V ALU**

Most-significant bit

might aroverflow

| Function | Binvert | Operation |
|----------|---------|-----------|
| and      | 0       | 00        |
| or       | 0       | 01        |
| add      | 0       | 10        |
| sub      | 1       | 10        |
| slt      | 1       | 11        |



#### **Overflow Detection**

- Overflow: the result is too large to represent in 64 bits
- Overflow occurs when
  - adding two positives yields a negative
  - or, adding two negatives gives a positive



- or, subtract a negative from a positive gives a negative (+) (-)
- or, subtract a positive from a negative gives a positive
- □ On your own: Prove you can detect overflow by:
  - Carry into MSB XOR Carry out of MSB, ex for 4 bit signed numbers





# **Overflow Detection**

|                   |                  | P                     |                        |                    |               |
|-------------------|------------------|-----------------------|------------------------|--------------------|---------------|
| A <sub>63</sub>   | B <sub>63</sub>  | C <sub>63</sub> (Cin) | C <sub>64</sub> (Cout) |                    | Overflow (OV) |
| 0_                | <u>°</u> 0       | 0                     | 0                      | 0                  | 0             |
| <b>0</b>          | <b>(*)</b> 0     | 1                     | 0                      | <sup>-)</sup> 1    | 1 ~           |
| ( <del>4)</del> 0 | (-) <b>1</b>     | Ó                     | 0                      | 1                  | 0             |
| 0                 | 1                | 1                     | 1                      | 0                  | 0             |
| 1                 | 0                | 0                     | 0                      | 1                  | 0             |
| 1                 | 0                | 1                     | 1                      | 0                  | 0 7           |
| <b>[-</b> ]1      | (-) <b>1</b>     | 0                     | 1                      | ( <del>-t)</del> 0 | 1 ~           |
| H1                | <sup>(-)</sup> 1 | 1                     | 1                      | <b>H</b> 1         | 0             |



# RISC-VALU: Relation between Binver & Co

□ For sub, slt, and branch: Binvert = 1 and  $C_0 = 1$ 

-A -B - A +B+1

□ For add: Binvert = 0 and  $C_0 = 0$ 

□ For logical except NOR: Binvert = 0 and  $C_0 = X = 0$ 

□ For NOR: Binvert = 1 and  $C_0 = X = 1$ 

□ Based on above: Binvert and C<sub>0</sub> are merged together into

once signal called **Bnegate** 

| Ainvert | Bnegate | Operation |            | Delas       |
|---------|---------|-----------|------------|-------------|
| 0       | 0       | 00        | AND        |             |
| 0       | 0       | 01        | OR         | 41.13       |
| 0       | 0       | 10        | ADD        | , -       / |
| 0       | 1       | 10        | SUB/Branch | less (      |
| 0       | 1       | 11)       | SLT (Vs,   | ) <(52)     |
| 1       | 1       | 00        | NOR        |             |



# **RISC-V ALU**





#### **Set if less than correction**

- □ Previously we connected  $Less_0$  to Set which is only correct if there is no <u>overflow</u> " > 1↑ A 3
- □ From the truth table below, we deduce that:







**Appendix A.6:** 

Faster Addition: Carry Lookahead

Polylogical operation = 1 gate delay + Jelan of MUKes

Delay of add/sub & slt = delay of 64-bit Ripple Carry

add/sub & slt = delay of 64-bit Ripple Carry

add/sub & delay of MUKes

# Improving Addition Performance 🤰 `and , or

The ripple-carry adder is slow complex = 1 gate delays u-bit RCA = 8 gate delays sov, xnor



Delay of Whit RCA = 2 \* N

Delay of 64-bit RCA = 128 gate

July

# **Carry-Lookahead Adder**



# **Carry-Lookahead Adder**

Carry generate and carry propagate



| a <sub>i</sub> | b <sub>i</sub> | g <sub>i</sub> | p <sub>i</sub> |
|----------------|----------------|----------------|----------------|
| 0              | 0              | 0              | 0              |
| 0              | 1              | 0              | 1              |
| 1              | 0              | 0              | 1              |
| 1              | 1              | 1              | 1              |

| □ g <sub>i</sub> = a <sub>i</sub> . b <sub>i</sub> |
|----------------------------------------------------|
|----------------------------------------------------|





26

#### **Carry-Lookahead Adder**

Carry Equations: 
$$\frac{1}{2}$$
  $\frac{1}{2}$   $\frac{1}{2}$ 

### 4-bit Carry-Lookahead Adder



( Delay of 16-bit RCA = 32 gate



9-input AND gate

# Correct 16-bit CLA

HW: Write equation, of Gz, Pz, Gz, Pzg C16

$$C_{4} = G_{0} + P_{0} \cdot C_{0}$$

$$(3) G_{0}$$

$$(4)$$

$$C_{8} = G_{1} + P_{1} \cdot G_{0} + P_{1} \cdot P_{0} \cdot C_{0}$$

$$(3) G_{0}$$

$$(4)$$

$$(4)$$

$$(4)$$

$$(4)$$

$$(5)$$

$$(3) G_{0}$$

$$(1) P_{0}$$

$$(3) G_{1}$$

$$(2) P_{1}$$

$$(3) G_{0}$$

$$(3) G_{1}$$

$$(4)$$

$$(4)$$

$$(5)$$

$$(3) G_{0}$$

$$(4)$$

$$(4)$$

$$(5)$$

$$(3) G_{0}$$

$$(4)$$

$$(4)$$

$$(5)$$

$$(3) G_{0}$$

$$(4)$$

$$(4)$$

$$(4)$$

$$(5)$$

$$(5)$$

$$(3) G_{0}$$

$$(4)$$

$$(4)$$

$$(4)$$

$$(5)$$

$$(5)$$

$$(5)$$

$$(5)$$

$$(0) a_1 - (5)$$
  
 $(0) b_1 - (5)$ 

 $\begin{array}{c}
(6) & 0 & (6) \\
(6) & 0 & (6) \\
(6) & 0 & (6) \\
(6) & 0 & (6) \\
(6) & 0 & (6) \\
(7) & 0 & (8) \\
(8) & 0 & (8) \\
(8) & 0 & (8) \\
(9) & 0 & (8) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10) & 0 & (10) \\
(10)$ 

C6 = 95 + P5 94 + P5 R44

(1) P5 = D C6

(1) P5 = D

Delay of 16-bit CLA = 9 gate delays

" " RCA = 32 " "

16-bit CLA is 3.5 times Pastr then

16-bit RCA

**CPE 232 MIPS Arithmetic** 

33

Delay = 5 gah dekys 4-bit CLA => 1 = 9 gale delays 16-bit "=) For every additional "CLL" lavel, the delay of CLA is incremented by "4" Why? One need 2 gate delays to compute Gi, Pi (2) ve need 2 extra gabe delays to compute the group. carry ins ((y, (8, (12,34...)) **CPE 232 MIPS Arithmetic** 



# **Larger Carry-Lookahead Adders**

$$P_0 = p0p1p2p3$$

$$G_0 = g3 + g2p3 + g1p2p3$$
  
+  $g0p1p2p3$ 

16-bit CLA

