Archive for Archietecture

Tomasulo’s Algorithm

Posted in Algorithm with tags , , on November 17, 2006 by wsjoung

An Efficient Algorithm for Exploiting Multiple Arithmetic Units, R. M. Tomasulo, IBM Journal, January 1967

instructions and it’s execution plan would be like this.

div Rg, Rb, Rc
add Rh, Rg, Rd (data hazard)
mul Rg, Re, Rf
add Ra, Rh, Rg (data hazard)

106670243_bb27bd269b.jpg

then, we should maintain RS(Reservation Station) for 2nd and 4th instructions because those instruction need to wait for previous result. and, RS may look like this. source operands of the 1st instruction are ready and the result would be tagged T1 for register Rg and then put this tag T1 and mark busy bit into register till the result comes out. but the first operand of 2nd instruction is not ready when this instruction decode. it need to wait for Rg which is tagged for T1 by previous instruction. by looking up the register, we can figure out. if busy bit is set, the register is not available. so, we should wait for this tagged result.

Ready Tag Contents Ready Tag Contents Register
Y Rb Y Rc T1
N T1 Y Rd T2
Y Re Y Rf T3
N T2 N T3 T4

RS and register will watch the result bus to get tagged result. for this example, we will get T1 first then 2nd instruction will be ready to go. when we got 2nd tagged result(T3), left operand of 4th instruction will be ready and the busy bit of the register Rg will be marked free(N).

busy Tab register
Y T4 Ra
Y T2 Rh
N T3 Rg

The Microarchitecture of the Pentium 4 Processor

Posted in System with tags , , on November 17, 2006 by wsjoung

It has been a while since the Pentium 4 Processor came out though, it is one of the most successful processors. so I believe that it still has a worth to read.

What is the topic of this paper?
The microarchitecture of the pentium 4 processor

What is the problem being addressed by the work reported on in the paper?
The computing environment needs more power and many applications need to process multi-media data like MPEG2/4 streaming video or audio.

Why is this problem important?
General improvement of the architecture is not enough to follow up the exponentially growing request of the computing power which caused by multi-media data processing. and successful architecture which is specialized in muti-media might be good chance for home computing market.

What is the research’s basic approach?
NetBusrst mocroarchitecture
Intel introduced NetBusrst mocroarchitecture to deal with balancing and tuning of many mocroarchitecture features that compete for processor die cost and for design and validation efforts in a fast processor.
This NetBusrst mocroarchitecture is composed of 4 sections.

– Front End
In Front End, instructions are fetched to uops and stored and executed from trace cache, only when there is a trace cache miss instructions are fetched and decoded from the L2 cache. it also has guided by branch predictor that directs where instruction fetching needs to go next in the trace cache so that they could reduced misprediction rate. and because of prefetcher, L2 cache maintains 256 bytes ahead of the current dfata access locations.

– Out-of-Order Execution Logic
This part of the machine re-orders instructions to allow them to execute as quickly as their input operands are ready by allocation buffers, register renaming, and uop scheduling functions.

– Integer and Floating-Point Execution Units
Execution units are the place where the instructions are actually executed. and there are several different execution units. there is a separate 128 entry register file for both the integer and the floating-point/SSE(Streaming SIMD Extension) operations. and each register file also has a multi-clock bypass network that bypasses or forwards just completed results.

– Memory Subsystem
The Memory Subsystem includes the L2 cache and the system bus. L2 cache is a cache for L1, trace cache and the system bus is a IO channel for memory. memory subsystem has high bandwidth, – L2 cache is 48Gbytes per sec when running at 1.5GHz, system bus is 3.2Gbytes per sec- it is critical to processing streaming data from memory.

Is this approach successful?
yes, the performance comparison graph showed 15-20% (interger) and 30-70% (floating point and multi-media) higher performance to compare with pentium 3 processor.

Where should this research go next?
Intel should move to 64 bit architecture next with better caching algorithm.
For better understanding, check those links.