DarkPoe
febrero 25, 2006, 09:41
Bueno, pues en estos dias que salen muchos nuevos procesadores al mercado, cada uno se pregunta: Porque hay este procesador que anda mas rapido que el otro si tienen la misma velocidad? o viceversa, porque este procesador, mas lento que el otro, puede tener mejor rendimiento en algunas aplicaciones que el otro que anda mas rapido?, entre otras preguntas que pueden estar relacionadas con esta parte del procesador que cada dia toma mas fuerza: el cache.
A continuacion entrego un documento que ha escrito CPUID (hace ya bastante) que pienso que es en verdad un aliviador de estas preguntas frecuentes y que tambien nos ayudara a entender porque el cache en algunos procesadores es mejor que en otros y algunos daticos extras que nos serviran para entender esta parte del procesador tan importante...
The L1 cache
The L1 code and L1 data caches of the K8 are very similar to the K7 ones. This seems logical regarding the similarities in the core of these two CPUs. This big size cache is very efficient as the K7 showed in the past. It uses a 2-way set associativity, that results in a two 32KB blocs organization. The size of these blocs allows them to contain a big range of data or code in the same memory area, but the low associativity tends to create conflicts during the caching phase.
The L2 cache
One more time, the L2 cache of the K8 shares lot of common features with the K7. They both use a 16-way set associativity that partially compensates for the low associativity of the L1.
The width of the bus between the core and the L2 cache increases, from 64 bits on K7 to 128 bits on K8. On the K7, this bus was sized according to the specifications of the first Athlon with discrete cache, but now this choice begins to show some limitations on the latest on-chip full-speed caches. The increase to 128 bits should allow to improve the L2 bandwidth, we'll check this in the bandwidth tests.
The K8 also includes an hardware prefetch logic, that allows to get data from memory to the L2 cache during the the memory bus idle time.
K7 and K8 use an exclusive relationship between L1 and L2, in opposition to Intel that uses an inclusive relationship. This choice has lot of consequences on the global cache architecture, that's why we'll now explain what these relations consists in, and what influence they have upon performance.
Inclusive and exclusive caches
In order to understand the way a cache works, let's consider the case of a CPU that has one cache level. When a read request occurs, the CPU will ask to its cache for the requested data. If it does not contain the data, the CPU will get it from memory and in the same time will copy it to its cache. Why ? because the CPU assumes that if it needed this data once, it may need it again soon. This statistically has good chances to occur. A x86 CPU contains a small number of registers, and the value that it just get back from memory to a register won't stay more than a few clock cycles, because the register will be quickly needed for another instruction. Storing the data in the cache is a way to keep it not too far.
With one cache level, a read request from the CPU has two possibles ends :
If the data is in the cache, there is a cache success. It is obviously the most favorable case.
If the data is not in the cache, there is a cache miss. The following step consists then in getting data from memory and copying it to the cache. This is the caching process, or cache-fill. At this point, two cases may occur, depending on the cache is already full or not. If it is not full, a new cache line is filled.
Figure 1 : cache fill
http://www.cpuid.com/reviews/K8/cache1.gif
The situation becomes more complicated if the cache is full. The cache fill will need an existing line to be replaced. In order to know what line must be replaces, the CPU uses a replacement algorithm. The most common choice consists in replacing the line that was the least recently used : this is the LRU algorithm.
Figure 2 : Eviction of a cache line
http://www.cpuid.com/reviews/K8/cacheanim1.gif
As the animation clearly shows, the evicted cache line is just lost. The first aim of a second level cache is to get this line back instead of deleting it. In another words, a role of garbage.
The addition of a 2nd level cache creates new possible states when a read request occurs :
the data is in the L1 : L1 success
the data is not in the L1 but is in the L2 : L1 miss, L2 success
the data is not in the L1 and not in the L2 : L1 and L2 missesLet's now see how this works. As long as the L1 is not full, the caching phase is the same as for the one cache level configuration :
Figure 3 : L1 cache fill
http://www.cpuid.com/reviews/K8/cache2.gif
As soon as the L1 is full, the L2 has an active role : when a line is evicted from the L1, it is copied into the L2, and a new line coming from memory is copied in the freed line :
Figure 4 : L2 fill
Update page or click on F5 to run the animation.
http://www.cpuid.com/reviews/K8/cacheanim2.gif
From this moment, the L2 contains data and is able to answer to a read request. If the requested data is not in the L1 but is in the L2, the line is one more time copied in the L1. Why not leave it in the L2 only ? For the same reason as before, the CPU may need it again. So, a line must be freed in the L1 to get the data from the L2. The LRU algorithm selects the candidate line, copies it into the L2, and the requested line from L2 is copied back in the L1.
Figure 5 : L1 miss, L2 success
Update page or click on F5 to run the animation.
http://www.cpuid.com/reviews/K8/cacheanim3.gif
So doing, we notice that a cache line never exists in the same time in the L1 and in the L2. This means that the L1 and the L2 do not contain the same data, and the data is exclusively in one cache level. This is the exclusive relationship.
The total cache size is consequently the sum of the size of the two cache levels. And this method works whatever the size of the L1 and the L2 are, the L2 can even be smaller than the L1.
The exclusive relationship allows lot of flexibility, but has a drawback in performance. In fact, when a L2 success occurs, a line from the L1 must be copied to the L2 before getting back the data from the L2. This additional step needs lot of clock cycles, and slowes down the total time needed to get the data from the L2.
In order to speed-up the process, the exclusive caches very often use a victim buffer (VB), that is a very little and fast memory between L1 and L2. The line evicted from L1 is then copied into the VB rather than into the L2. In the same time, the L2 read request is started, so doing the L1 to VB write operation is hidden by the L2 latency. Then, if by chance the next requested data is in the VB, getting back the data from it is much more quickly than getting it from the L2.
The VB is a good improvement of the exclusive relationship, but it is very limited by its small size (generally between 8 and 16 cache lines). Moreover, when the VB is full, it must be flushed into the L2, that is an additional step and needs some extra cycles.
In fact, in order to avoid this additional write in the L2 in case of L2 success, this write should be done before. How can it be ? Well, this line comes from L1, so it was written to the L1 in the process history. Then, if this line is copied in the L2 in the same time, it will already be in the L2 !
In this configuration, a data is get from memory and copied into the L1 and the L2. So doing, the caching step needs two writes instead of one.
Figure 6 : Caching
http://www.cpuid.com/reviews/K8/cache3.gif
Once the L1 is full and the requested data is not in the L1 and not in the L2, a new line is then copied into both levels. This will result in a deleted line in the L1, but there is no need to save to the L2 because it is already in the L2. So, the total number of writes is the same as in the previous configuration.
From this point, the L2 cache contains data that are not in the L1.
Figure 7 : L1 and L2 miss
Update page or click on F5 to run the animation.
http://www.cpuid.com/reviews/K8/cacheanim4.gif
If the requested data is not in the L1 but is found in the L2 (L2 success), the only needed operation is to copy a line from L2 to L1. So, only one write instead of two for the previous configuration.
Figure 8 : L1 miss, L2 success
http://www.cpuid.com/reviews/K8/cache4.gif
In this configuration, all the lines of the L1 are duplicated in the L2, in other words an image of the L1 is included in the L2. This is the inclusive relationship.
An inclusive cache allows to avoid one write in case of L2 success, that makes it faster than an exclusive cache for this step. In practice, an inclusive L2 cache is faster than an exclusive one. On the other hand, the duplication of the L1 in the L2 reduces the "useful" size of the L2 cache from the L1 size. That means :
the L2 size must be greater than the L1 size, and the efficiency of the L2 depends on this size difference.
the total "useful" cache size is : L1 size + L2 size - L1 size, that is to say : L2 size.Advantages and drawbacks of each method
This table summarizes the plus and the minus of an exclusive cache :
+
No constraint on the L2 size.
Total cache size is sum of the sub-level sizes.-
L2 performance.
Regarding this, we can guess what an exclusive cache must look like :
A big size L1 cache. This is possible because there are no constraint on the L2 cache size. Moreover, a big L1 reduces the access to the L2.
A victim buffer to improve performances.AMD made the choice of an exclusive relationship for the first time on the Thunderbird. The CPU architecture fits on this choice, with a big L1 cache and a 8-entries victim buffer.
This choice allowed AMD to build CPUs with a L2 cache size from 64 to 512KB with the same core, and even the Duron that has a 64KB L2 cache provides very good performance. In another hand, the increase of the L2 size does not provide a big jump in performance.
In comparison, an inclusive cache provides :
+
L2 performance.-
Constraint on the L1/L2 size ratio
Total cache size.
This table is exactly the opposite of the exclusive table. Indeed, the only advantage of an inclusive cache stands in the performance, but the improvement needs some conditions to be respected.
It is far from being easy.to draw what an inclusive cache should look like. The constraint on the L1/L2 size ratio needs the L1 to be small, but a small size will result in reducing its success rate, and consequently its performance. On the other hand, if it is too big, the ratio will be too large for good performance of the L2. In a word : headache.
Intel made the choice of an inclusive relationship with the Pentium Pro, that is the first CPU than includes L2 cache on chip. This choice was used on the whole CPU line following the PPro. That's why no Intel CPU has a very large L1 cache. The biggest size was reached on the Pentium M, that includes a 64KB L1 and a 1MB L2.
The Pentium 4 was introduced with a very small 8KB data L1. This choice was made for two reasons : the Pentium 4 was the first CPU designed with an integrated full-speed L2 cache (excepting the PPro, but Pentium II and III began with a discrete L2 cache) ; so, the CPU architecture was designed knowing that the small L1 could be supported by a large a fast L2 ; moreover, a very small L1 can be very fast, and the Pentium 4 L1 cache has the lowest latency ever seen with 2 clock cycles.
The constraints of an inclusive L2 are hardly compatible with commercial considerations. In fact, it is very hard to build a CPU line with such constraints. Intel released the Celeron P4 as a budget CPU, but its 128KB L2 cache completely breaks the performance. The Celeron P4 does not fit the constraints of the inclusive relatonship, and the result is catastrophic. On the other hand, an inclusive relationship can be very efficient, as the Pentium M shows.
Conclusion
The choice of a cache architecture is a very important step in the design of a CPU, as it determines the performance, but also the evolution in the low and high range.
The exclusive relationship is the most flexible, as it allows lot of different configurations in keeping a good performance index. The drawback is that the performance does not increase very much with the L2 size. The inclusive relationship can only be chosen for performance purpose, knowing for example that increasing the L2 will create a performance boost. However, the constraints of this mode are very hard, and not respecting them can have the opposite result and break the performance.
Espero que les guste :)
alexander_coras
febrero 26, 2006, 01:45
Buen Aporte Darkpoe, solo decir que cache es memoria integrada por lo tanto los datos los tiene directamente, entre mas tenga mejor como la Ram misma del sistema es por ello que entre mas capacidad tenga el preocesador mas caro es el precio.
Breve explicacion por mi parte, para los que decian que el Semprom o celerons rinden lo mismo que los Athlons o Lentiums 4.
[CP] Chechito LAN.CO
febrero 26, 2006, 06:57
Buen articulo hace ya un tiempo lo habia leido y es bastante enriquecedor.
vBulletin®, Copyright ©2000-2008, Jelsoft Enterprises Ltd.