### IBM PowerPC 970 (a.k.a. G5)



Ref 1

David Benham and Yu-Chung Chen UIC – Department of Computer Science CS 466

## PPC 970FX overview

- 64-bit RISC
- 58 million transistors
- 512 KB of L2 cache and 96KB of L1 cache
- 90um process with a die size of 65 sq. mm
- Native 32 bit compatibility
- Maximum clock speed of 2.7 Ghz
- SIMD instruction set (Altivec)
- 42 watts @ 1.8 Ghz (1.3 volts)
- Peak data bandwidth of 6.4 GB per second

## A picture is worth a 2^10 words (approx.)

#### PowerPC 970FX



Ref 2

# A little history

- PowerPC processor line is a product of the AIM alliance formed in 1991. (Apple, IBM, and Motorola)
- PPC 601 (G1) 1993
- PPC 603 (G2) 1995
- PPC 750 (G3) 1997
- PPC 7400 (G4) 1999
- PPC 970 (G5) 2002
- AIM alliance dissolved in 2005

#### Processor



Ref 3



## Core details

- 16(int)-25(vector) stage pipeline
- Large number of 'in flight' instructions (various stages of execution) theoretical limit of 215 instructions
- 512 KB L2 cache
- 96 KB L1 cache
  - 64 KB I-Cache
  - 32 KB D-Cache

## Core details continued

#### • 10 execution units

- 2 load/store operations
- 2 fixed-point register-register operations
- 2 floating-point operations
- 1 branch operation
- 1 condition register operation
- 1 vector permute operation
- 1 vector ALU operation
- 32 64 bit general purpose registers, 32 64 bit floating point registers, 32 128 vector registers

## Pipeline



Ref 4

## Benchmarks

- SPEC2000
- BLAST Bioinformatics
- Amber / jac Structure biology
- CFD lab code

# SPEC CPU2000

- IBM eServer BladeCenter JS20
- PPC 970 2.2Ghz
- SPECint2000
- Base: 986 Peak: 1040
- SPECfp2000
- Base: 1178 Peak: 1241
- Dell PowerEdge 1750 Xeon 3.06Ghz
- SPECint2000
- Base: 1031 Peak: 1067
- SPECfp2000
- Base: 1030 Peak: 1044

| SPECfp_rate_base2000  | 15.7 |
|-----------------------|------|
| Dual 2GHz PowerPC G5  |      |
| Dual 3.06GHz Xeon     | 11.1 |
| Single 3GHz Pentium 4 | 8.07 |
| SPECint_rate_base2000 |      |
| Dual 2GHz PowerPC G5  |      |
| Dual 3.06GHz Xeon     | 16.7 |
| Single 3GHz Pentium 4 | 10.3 |

#### Apple's SPEC Results\*2

## BLAST





## BBSv3





## Amber/jac

#### **PME** Simulation

"jac" — Joint Amber/Charrm DHFR benchmark. This is the protein DHFR, solvated with TIP3 water, in a periodic box. There are 23,558 total atoms, and PME used with a direct space cutoff of 9 Ang. This is the benchmark in benchmarks/jac subdirectory of the Amber 7 distribution.

| node-name   | CPU             | OS      | compiler np       | cu time-per-step |
|-------------|-----------------|---------|-------------------|------------------|
| PSSC Labs   | Dual 2G Opteron | Linux   | PathScale 1.0b3 1 | 0.700            |
| PSSC Labs   | Dual 2G Opteron | Linux   | PathScale 1.0b3 2 | 0.360            |
| PSSC Labs   | Dual 2G Opteron | Linux   | PGI 1.2.5 1       | 0.780            |
| PSSC Labs   | Dual 2G Opteron | Linux   | Intel F90 7.1 1   | 0.720            |
| Luo Lab     | Dual 2.8G Xeon  | Linux   | Intel F90 7.1 1   | 0.934            |
| Luo Lab     | Dual 2.8G Xeon  | Linux   | Intel F90 7.1 2   | 0.675            |
| Luo Lab     | Dual 2.8G Xeon  | Linux   | Intel F90 7.1 4   | 0.576 (HTT)      |
| Prof. E. M. | Dual 2G G5      | MacOS X | XLF 8.1b 1        | 0.777            |

Ref. 7

#### CFD code Prof. Sean Garrick, Dept. ME., Univ. of Minnesota

# Large (500MB)



## VMX

- PPC 970 = simplified Power4 + VMX
- a.k.a. Velocity Engine(Apple), AltiVec (Motorola)
- A vector processing add-on to PowerPC RISC instruction set
- Simple Instruction Multiple Data (SIMD)



# Single Instruction Multiple Data -SIMD

- SIMD vs. Instruction Level Parallelism
- Parallel in 'Data' vs. parallel in 'instructions'





Advance

#### IBM PowerPC 970FX RISC Microprocessor





## Simple Code example

- 400Mhz G4, vector size: 1000
- Matrix addition (88 vs. 345 MFLOPS, 3.9X)
- Matrix rotation (300 vs. 472 MFLOPS, 1.6X)
- Matrix multiplication (84 vs. 384 MFLOPS, 4.6X)
- Apple's vector multiplication algorithm: up to 8X!

# Applications

- Good for math, science and graphics manipulation
- Scientific array processing systems
- Muti-channel modems, echo cancelers, image and video processing system
- Internet routers

## "Power Everywhere"

- POWER Performance Optimization With Enhanced RISC
- 6 out of top10 in current top500 list are IBM RISC machines
- ps. Current No.5 is based on PPC970
- From high performance(supercomputer) to low power(embedded system)

## "Power Everywhere"

- Power5
- Cell processor
- Power 970FX
- IP telephony, Internet modems, routers, game consoles
- Mars rovers: 32-bit RISC running VxWorks
- Expect more to come: Power.org



## References

- 1)http://www.macvillage.de/pages/x\_magazin/ibmchips/Power970F X.jpg
- 2)www.anandtech.com/mac/showdoc.aspx?i=2436&p=2
- 3)IBM PowerPC 970FX RISC Microprocessor User's Manual
- 4)http://perso.wanadoo.fr/kakace/PowerPC/PPC970.html
- 5)http://www.spec.org/cpu2000/results/
- 6)http://www.apple.com/
- 7)http://apple.sysbio.info/~mjhsieh/archives/000295.html
- 8)http://www.xlr8yourmac.com/G5/G5\_fluid\_dynamics\_bench/G5\_ fluid\_dynamics\_bench.html
- 9)http://arstechnica.com/articles/paedia/cpu/simd.ars
- 10)IBM PowerPC 970FX RISC Microprocessor User's Manual

#### Backup Slides









## **Execution Units**

- Vector Permute Unit: 1-stage execution
  - Manipulate vector elements fast
- Vector ALU
  - Floating-point: 7-stages execution
  - Simple fixed: 1-stage execution
  - Complex fixed: 4-stages execution

## Matrix Multiplication



$$c_{ij} = a_{i1}b_{1j} + a_{i2}b_{2j} + \dots + a_{ik}b_{kj} + \dots + a_{in}b_{nj}$$

### In Scalar

for  $(i=1, i \le n, i++)$ for (j=1; j<=n; j++) { for (k=1; k<=n; k++) { c[i, j] = c[i, j] + a[i, k] \* b[k, j];

### In Vector

```
for (i=1; i<=n; i++) {
for (j=1; j<=n; j++) {
for (k=1; k<=n/4; k++) {
c[ j+(i-1)*n ] = vector_madd(a[k+(i-1)*n],
b[j+(k-1)*n],
c[j+(i-1)*n];
```

## References Continued

 http://ascii24.com/news/i/topi/article/2005/07/07/ 656844-000.html





