Neil Franklin's Recording of Lecure from Hennessy

My own additions/comments for non-EEs (in particular my office mates I recorded and am writing this for) are in [italic inside square brackets].

Recording of Lecture of Prof John L. Hennessy, President, Stanford University

Directions and Challenges in High Performance Microprocessors

At 2003.03.27 16:15 in the ML E12 building at ETH Zuerich.

Intro by Speaker

He is here in Switzerland because getting an honorary doctorate from EPF Lausanne. Is a personal friend of the speaker. Was invited to do a detour to ETH Zuerich.

Actual Lecture

Microprocessors have been used and been growing in speed for 32 years (2^5, [citation:] "far more important to us engineers than 30 or 35").

Power has increased at about factor of 1.6/year. [32 years makes 3.4mio, makes a 1971 4004 equal to an 2003 3.4GHz Pentium 4 clocked down to 1kHz. That looks about right, with 4004 needing 8 750kHz clocks per simple 4bit instruction]

Clock speed has grown linear [3400/0.75=4500] while chip density has gone up logarithmic and more [55'000'000/2000=26'500]. Power increse is due 1/2 to clock speed increase and 1/2 to architecture improvements [which are enabled by using the density].

In 1990 everyone was talking about the memory wall [access speed of large DRAMs]. Since then ever refined caching techniques have solved this problem, speed is mainly dominated by processor internal improvements. [Memory still is limiting, it grows, but still limits]

Microprocessor architecture improvements:

1985 pipelining [roughly 386 generation]
1990 statical multiple issue [i860, IBM POWER (= original RS/6000)]
1995 scheduled dynamical multiple issue [Pentium class machines]

[Pipelining is when one instruction is started before the last is finished. Multiple issue (MI) is fetch/start multiple instructions in one clock cycle. In statical MI the instructions says which combination should run, in dynamical MI the processor figures out what it can run]

Trend is towards relaxing limits on what can be achieved in parallel, and so getting more chances at using parallelism.

Scheduling is nothing new, was already done in the IBM 360/91 [around 1970]. But it is very difficult to design. There were only 8 360/91s shipped, all with bugs in them, mainly interrupt synchronisation special cases.

Today with microprocessors we can implement such techniques. We have high level design methods which assure what is right at the top stays right after detailed design. Also we can afford 100-man design teams for microprocessors.

IA64 using VLIW is a step back to fixed/static issuing. Because supposedly simpler and so faster. But is less parallel and slower. [citation:] "is a bunch of nonsense" [static has tradition in Intel, see the i860, vs IBMs RS/6000 Power, i860 difficult to use, failled].

Organising issuing:

software           vs          hardware

fixed to one implementaiton    adapts flexibly
longer range of analyis        more lasting performance
simpler hardware               high complexity

Peak vs sustained power was 1.3 in 1984, is 3.0 in 2003, getting worse. Peak is what costs hardware, [sustained is what you get,] bad ratio here is what wastes/loses getting power from it.

Speculative execution allows to get results from not yet decided input. Later select the proper or recalculate.

But speculative costs time correcting and chip space and power for unused stuff. In 1990s speculative was regarded as always worth it, today only worth it if quite like to be used used.

Fundamental limit here will be power consumption [implies heat generation, getting that out of the chip is the real limit today].

Speed up needs parallel execution of code. Limit is given by parallelisablility.

ILP (Instruction Level Parallelity) is limited by storing intermediate steps, by limit in size of buffering. Grows massively if balancing load over many clock cycles.

Also registers get large and slow. Register ports 2 reads and 1 write per parallel running instruction, so 8 parallel is 16 reads and 8 writes at once [single port is 4 transistors storage and 2 for port = 6, 16+8 port gives 4 + (16+8)*2 = 52, so 9 times more].

Also add in the larger sets of registers needed for register rewriting [temporarily storing speculative calculation results until it has been selected which is the proper one, than use its register for further work].

Also scheduling itsself. Try and find all data dependancies in 8 instructions, in 1 clock cycle. And there are 8 more coming in the next cycle!

ILP theoretically can go up to about 150 in FP (Floating Point) code. Even gcc reaches 56 [that is compiling with gcc, not running the compiled code].

But discovering it is hard, as the parallel sequences are sometimes 100'000 instructions appart!

With reduced window size, discoverable ILP drops fast. Even getting 20 from gcc will be great.

Pentiums since Pentium Pro are really RISC machines, with an i86 translater before them [same also AMD since K5, earlier, then by Nexgen, AMD bought them up, made K6 from it].

Pentium Pro issues up to 3 RISC instructions per i86 instruction [think of i86->RISC like algorithm dependant video decompression in hardware].

i86 that will not fit in this, are run by an separate state machine [ Intel documents this as an microcode ROM], that generates larger sequences. Compilers should avoid them.

All the pipelining and optimising is done then at RISC level.

DEC tried to do the same thing at complex instruction level and failled [I assume this to be a referrence to VAX which failled the speed race, so DEC went over to Alpha].

Pipelining steps: from ordered opcodes, scramble to fit free execution units, work, reassemble in order for storing to registers and allowing further processing.

Due to non optimal hardware usage, processor spends some time waiting.

One large FP DIV instruction blocks up all the rest for a long time.

FP code calculations near 0% wait, but large data fetch waits. gcc is 35% branch waits, perl also 25% [were the 2 worst on chart of 6 or 7].

x86->RISC translation is 5% of waits in FP and 10% in gcc and perl [that is the maximal gain from switching from x86 to real RISC such as Alpha, and RISC will increase code fetch load and waits, due to larger code size, /bin of NetBSD/i86 -> NetBSD/MIPS is factor 1.7].

Pentium 4: 55mio transistors, 131mm^2, 82W
IBM Power 4 (2 proc on 1 chip): 184mio transistors, 267mm^2, 42W
Itanium: 221mio transistors, 421mm^2, 130W

Actual speed (SPEC): FP sometimes Itanium, sometimes Power 4 leads. Integer Pentium 4 beats both solidly [factor 2 visible on chart].

Efficiency: power/transistors and power/mm^2 Pentium 4 wins, power/W Power 4 wins.

Speculation costs transistors and uses up power on generating unused results.

More ILP costs ever more speculation, over-linear number of transistors. Together with clock more power usage gives an ever steeper ILP mountain. Parallelling it is an also ever steeper cache mountain.

In the past: easy path adding features
Short term: optimise combination of features
Future: more efficient parallelism than ILP. Not square growth from accelerating 1 instruction stream, but linear from running n streams

Can be natural in case of web server, one thread per user. [side remark:] "Apache is a cache killer".

Switch to explicite parallelism, treaded applications.

This is strongly dependant on the application. While natural for webserver, needs special code or compiler for others.

FP can be done efficiently, but FP can not drive the market today, too small volume.

Software for parallelism must become easier to make/use for small scale parallelism. Research in 1990s went into 300-1000 massively parallel processor systems. All research grants only for this spectacular stuff. Only small amount of these made, was a waste of intellectual capacity. Needs to be refocused on small scale 2/4/8/10 [yes, he did say 10, not 16 :-)] processor systems.

Q&A session

Q: [not recorded :-)]

A: Processor development is very expensive, $100 mio for Itanium. Finance comes today from volume, new stuff is driven by PCs and some embedded [embedded today means microcontrollers or ARM/PPC/MIPS or DSPs]. Does not believe that Sun will be able to keep its own architecture.

Q: [not recorded, but remembered: What about PC graphics, often larger than processor itsself]

A: Big wins are possible with exploiting special algorithms, such as PC graphics. But this only lasts so long a sustained market for selling ever more power in that algorithm. If general processors can catch up, the specialist dies. This may happen to DSPs as general embedded are getting very fast, what market for the DSPs?

Q: [not recorded]

A: Higher speeds are also shortening time left to do the work. Pentium 4 has 4*5 pipeline stages for work and 2 just for signals from one part of chip to other, without any work being done. Most critical is getting clock to all parts without too much skew [skew is the time scatter of clock arriving at different register bits. First ones change data and feed that onwards before others store theirs, if too larger delay, the late ones may already set to next/mixed result]

Low skew requires high power drivers to force it through. In Alpha 1/3 of all power goes into clock distribution. In Pentium 4 possibly even more than that [this is the reason for asynchronous logic reappearing, but only research at moment, no commercial significance yet].

Home | Informative Texts | Texts

This page is by Neil Franklin, last modification 2003.03.28