Archive for CPU

The Microarchitecture of the Pentium 4 Processor

Posted in System with tags , , on November 17, 2006 by wsjoung

It has been a while since the Pentium 4 Processor came out though, it is one of the most successful processors. so I believe that it still has a worth to read.

What is the topic of this paper?
The microarchitecture of the pentium 4 processor

What is the problem being addressed by the work reported on in the paper?
The computing environment needs more power and many applications need to process multi-media data like MPEG2/4 streaming video or audio.

Why is this problem important?
General improvement of the architecture is not enough to follow up the exponentially growing request of the computing power which caused by multi-media data processing. and successful architecture which is specialized in muti-media might be good chance for home computing market.

What is the research’s basic approach?
NetBusrst mocroarchitecture
Intel introduced NetBusrst mocroarchitecture to deal with balancing and tuning of many mocroarchitecture features that compete for processor die cost and for design and validation efforts in a fast processor.
This NetBusrst mocroarchitecture is composed of 4 sections.

– Front End
In Front End, instructions are fetched to uops and stored and executed from trace cache, only when there is a trace cache miss instructions are fetched and decoded from the L2 cache. it also has guided by branch predictor that directs where instruction fetching needs to go next in the trace cache so that they could reduced misprediction rate. and because of prefetcher, L2 cache maintains 256 bytes ahead of the current dfata access locations.

– Out-of-Order Execution Logic
This part of the machine re-orders instructions to allow them to execute as quickly as their input operands are ready by allocation buffers, register renaming, and uop scheduling functions.

– Integer and Floating-Point Execution Units
Execution units are the place where the instructions are actually executed. and there are several different execution units. there is a separate 128 entry register file for both the integer and the floating-point/SSE(Streaming SIMD Extension) operations. and each register file also has a multi-clock bypass network that bypasses or forwards just completed results.

– Memory Subsystem
The Memory Subsystem includes the L2 cache and the system bus. L2 cache is a cache for L1, trace cache and the system bus is a IO channel for memory. memory subsystem has high bandwidth, – L2 cache is 48Gbytes per sec when running at 1.5GHz, system bus is 3.2Gbytes per sec- it is critical to processing streaming data from memory.

Is this approach successful?
yes, the performance comparison graph showed 15-20% (interger) and 30-70% (floating point and multi-media) higher performance to compare with pentium 3 processor.

Where should this research go next?
Intel should move to 64 bit architecture next with better caching algorithm.
For better understanding, check those links.