Comp Arch Notes -The Memory Subsystem PDF

Title	Comp Arch Notes -The Memory Subsystem
Course	Computer Architecture
Institution	University of Hertfordshire
Pages	10
File Size	131.6 KB
File Type	PDF
Total Downloads	33
Total Views	133

Preview

CLICK TO PREVIEW PDF

Summary

Here is Comp Arch Notes -The Memory Subsystem...

Description

Computer Architecture 5COM1057 The Memory Subsystem Part 2 1.

Aims

The aims of this unit are to introduce the basic memory hierarchy of a computer system: • for you to have an understanding of how to quantify cache memory performance, • for you to have an insight into the complexities of improving cache memory performance.

1.1. Learning Outcomes At the end of this unit, you should be able to: • understand some of the fundamental engineering principles of processor system design, • show how the processor interacts with its immediate memory subsystem.

1.2. Reading These notes and the lecture slides provide all information required of the typical Computer Architecture student. If you wish to read around the topic, there are many books on Computer Architecture provided by Information Hertfordshire.

1.3. Introduction As a recap from parts 1 and 2 of the Memory Subsystem, we have considered where is a block placed in the cache, how a block is found in the cache (Data Access), which block (if any) should be replaced on a cache miss and what happens on a write to memory. What we also need to consider is the performance of our cache and whether we need to introduce additional caches (buffers) into our memory hierarchy and how such additional hardware will not only give beneficial performance but will it also impact on performance. This means we must undertake many optimisation studies and allow for trade-offs.

CompArch Aug 2016 CE

1

2.

A Unified Cache or a Split-level Cache?

We should consider whether a unified cache which will supply both instructions and data will service the needs of our client or whether our client will need the separation of instructions and data into an Icache and a D-cache. If we use a unified cache we will have to consider that load and store instructions will cause a bottleneck and we could be in the case that if a load and a store instruction are executed simultaneously then a 'collision' will occur whilst both attempt to access the unified cache. This is known as a Structural Hazard, which we consider in more detail when we study pipelining. A split-level cache will overcome the Structural Hazard problem, but it will incur additional overheads and costs in the cache design, which may not suit our clients' needs. What we would achieve is a separation of data memory from instruction memory. This also means we would have to consider the memory model of the lower level of the memory hierarchy. In the case we are looking at now, that would be main memory. It is worth noting that about 25% of instructions in general purpose computing are loads and stores (data accesses). This means 1 in 4 instructions could be a load or a store. Such a high frequency of data memory accesses suggests that a split cache may be beneficial for performance, but as ever with such additional hardware design complexities then there are overheads which will restrict any potential performance gain.

2.1. Separate the Cache In a split cache the CPU knows whether it is issuing an instruction address or a data Address. Therefore, there can be separate ports for both: one for instruction address access and one for datum address access. Hence, bandwidth is doubled between the memory hierarchy and the CPU. This is something which we can use to our advantage as we will see when we study pipelining. A major advantage of splitting the cache is that the data cache and the instruction cache can be optimised separately. This means that the two cache could have different sizes (capacities), different block sizes, different configurations, direct mapped and set-associativity. If the I-cache is set associative and the D-cache is also set-associative then the associativity may also be different.

CompArch Aug 2016 CE

2

2.1.1.

Miss rate comparison between I-caches and Dcaches

Studies have shown that the miss rate of I-caches tends to be lower than D-caches. Therefore, using the miss rate as a (single) metric, suggests better performance will be achieved by separation of caches as opposed to a Unified cache. On the down side, splitting a unified cache into an I-cache and a Dcache will fix the cache space devoted to each type. This means that capacity and collisions may be a problem unless careful optimisation studies are not carried out beforehand. Careful optimisation studies will increase the development costs and which may not service our clients' needs. 2.1.2. Average miss rate of a split cache To determine the Average Miss Rate with split caches we need to know the percentage of memory references to each cache (the Icache and the D-cache). We mentioned earlier that in general purpose programming 25% of cache references could be data references (loads and stores) and hence 75% of general purpose cache references will be instructions, which is a ratio of 1 : 3. Using the miss rate as a single metric to view cache performance is not a good idea, it is better to use the average time to access memory (AMAT) where: AMAT = Hit Time + Miss rate * Miss Penalty in cc

2.2. Cache performance metrics AMAT tells us that when optimising a cache we should attempt to: • Reduce the miss rate, • Reduce the miss penalty, • Reduce the hit time. 2.2.1. Reduce the miss rate To reduce the miss rate we have to consider the 3Cs problem: • Compulsory misses (Cold starts), • Capacity, • Collisions (Conflicts).

CompArch Aug 2016 CE

3

Compulsory misses occur on the very first access to a cache block. The the block cannot be in the cache and the valid bit will not be set. The block must, therefore, be brought into the cache from a lower level of the memory hierarchy. Compulsory misses are also known as as Cold Start Misses or First Reference Misses. Capacity misses mean that there are more memory blocks references than the size of the cache. This means that cache blocks have to be replaced by blocks from a lower level of the memory hierarchy. A cache block that has been replaced might be referenced again resulting in yet more replacements, in particular in nested loops. Collisions occur in direct mapped caches and also in set-associative caches (but to a lesser degree). A collision occurs if too many memory blocks of a lower level of the hierarchy map to the same cache block or set of cache blocks. As with capacity a cache block can be replaced and later retrieved from a lower level of the hierarchy. These are also known as Conflict misses or Interference Misses. 2.2.2. Reduce the miss rate considerations We could increase the cache line size by having more than one word in the cache line (known as multi-word blocks), which would reduces the number of compulsory misses. This is a very attractive design consideration as it will take advantage of the principle of spatial locality. However the downside is that larger block sizes increases the miss penalty. Also, for a fixed cache size, increasing the block size reduces the number of blocks in a cache and therefore increases the number of capacity misses. Finally, a larger block size may also increase the number of collisions. We have to be careful of the trade-offs when undertaking optimisation studies. The trade-off between reducing the amount of compulsory misses by increasing block size against the increase in the number of capacity misses and the increase in the number of collisions. Hence, there is no benefit in reducing the miss rate if the AMAT is increased due to an increase in the miss penalty. In a set-associative cache we could increase associativity, but with increasing associativity the hit time is increased due to the additional hardware complexity (the amount of combinational logic in the mux).

CompArch Aug 2016 CE

4

In fact increasing associativity may lead to a hit time of multiple clock access, thereby reducing the performance of the cache. This means that there is a trade-off between the reduced miss rate and the impact on clock cycle time, which must be considered when undertaking optimisation studies.

Reduce the miss rate and miss penalty considerations 2.2.3.

As design engineers we could provide additional hardware by using what is known as a Stream Buffer. The Stream Buffer is a way to prefetch instructions and data by fetching two consecutive memory blocks from a lower level of the hierarchy on a cache miss. The first memory block is loaded into the cache as usual and the second memory block is placed this Stream Buffer. You should note that there can be separate data and instruction Stream Buffers. A Stream Buffer may be a small capacity cache which could be fully associative. The idea of the Stream Buffer is that it can be accessed more quickly than main memory i.e. have a low hit time. If there is a hit in the Stream Buffer, then that block can be used to service the processor request and it can then be updated into the cache at cache speed. Following the principle of spatial locality the next memory block is also prefetched from main memory and added to the Stream Buffer. We are hence predicting which memory block the CPU will call soon and having that memory block read for execution in a faster form of memory than main memory. This is good so long as the principle of spatial locality is upheld. However, when a branch instruction is encountered the next memory reference may not follow spatial locality and may follow temporal locality, in which case we have prefetched the wrong instruction, and we may have prefetched the wrong datum. This then impacts on the miss penalty and therefore will increase the clock cycle time. What we can now do as design engineers is to add yet another buffer (cache) known as a Victim Cache. The idea of the Victim Cache is to reduce the miss rate without having an impact on the miss penalty or the clock cycle time. A Victim Cache Is a small fully associative cache that is placed between the main cache and its refill path. The Victim Cache only holds cache blocks that have recently been discarded from the main cache as a result of a cache miss. On a main cache miss (or in parallel) the Victim Cache is checked for the cache block. Since, the requested cache block has not been replaced into main memory then the requested cache block can be

CompArch Aug 2016 CE

5

returned to the main cache at cache speed rather than main memory speed and hence the miss penalty is not impacted on. Vitim Caches are only used for instructions and not data. 2.2.4. Our new hierarchy Figure 1 is a block diagrammatic representation of our new memory hierarchy for instructions (and not data). The solid arcs show the processor requesting an address and the dashed arcs show the responses from the appropriate buffers (caches).

CPU

I-cache

Victim Cache

Stream Buffer

Lower level of the memory hierarchy e.g. Main Memory

Figure 1: An instruction memory hierarchy with a Stream Buffer and a Victim Cache We can see that the processor makes a memory reference requested which is passed in this example in parallel to the I-cache, the Victim Cache and lower level of the memory hierarchy. If the I-cache can service the request (a hit) then the requested instruction is returned to the processor at cache speed. If there is a

CompArch Aug 2016 CE

6

miss in the I-cache then the requested could be serviced by the Victim Cache. In this diagrammatic representation the Victim Cache services the I-cache, which then services the processor. Hence, there is some additional penalty. This would be the miss penalty of the Victim Cache. When updating the I-cache, the Victim Cache would perform a fully associative write and hence any additional delay would be minimal when compared with the miss penalty of accessing main memory. On a cache miss, the lower level of the memory hierarchy would service that request and hence a miss penalty would be incurred. However, the next sequential instruction would also be read into the Stream Buffer and so long as spatial locality is followed this block is now ready to be read into the main cache and hence minimise the miss penalty. The Stream Buffer would be fully associative and hence this read would be extremely fast.

2.3. Making writes faster We know that a write occurs when a store instruction is encountered. We have also already looked at the write policies of: write through, write back and write back with a dirty bit. In the previous sections we have seen that the introduction of additional buffers (caches): the Victim Cache and the Stream Buffer can minimise the impact of the miss penalty and also reduce the miss rate. We could therefore consider the addition of yet another buffer, the Write Buffer, to minimise the impact of the write (update) penalty.

2.3.1. The Write Buffer The Write Buffer is a a small FIFO buffer that is place between the main D-cache and the lower level of the memory hierarchy e.g. main memory. It decouples writes to main memory from the main Dcache. Case 1: The D-cache is write through To initiate a memory write the D-cache loads (copies) the address and datum into the Write Buffer. Memory writes to the lower level of the hierarchy then precede independently from the CPU and at the speed of the lower level of the memory subsystem (main memory). Hence the impact of miss penalty is reduced. This is a very attractive design engineering decision as there is a high impact of the miss penalty with a write through policy.

CompArch Aug 2016 CE

7

However there is a problem with the use of a Write Buffer. As a result of a read miss, the CPU will attempt to read data by passing the request to the lower level of the hierarchy (e.g. from a main memory location). But that lower level of the hierarchy (main memory location) has still to be updated in the write buffer. We can therefore say, Write Buffers complicate memory accesses in that they may hold the updated value that is needed on a read miss. To overcome this problem, since the Write Buffer is FIFO, we could stall memory reads until the write buffer is empty, or we could permit memory reads to overtake writes that are held in the write buffer (outof-order reads). In which case an associative address lookup is now required in the write buffer. Case 2: The D-cache is write back Suppose a read miss replaces a dirty memory block. We would then copy the dirty block to the write buffer, then read from the lower level of the memory hierarchy (main memory) and then write to that memory. Instead of writing the dirty block to the lower form of memory (main memory) and then reading from that memory. The benefit of this approach would be that the CPU read would probably finish earlier and, therefore, a read stall would be unnecessary. Similarly, to the write-through policy – if there is a read miss the processor can either stall until the write buffer is empty or do an associative lookup for buffer collisions. Figure 2 shows the CPU interaction with a memory hierarchy including a (FIFO) Write Buffer.

CompArch Aug 2016 CE

8

CPU

D-cache Write Buffer (FIFO)

Lower level of the memory hierarchy e.g. Main Memory Figure 2: The memory hierarchy with a FIFO Write Buffer

3.

Multi-level Caches

We'll assume a memory hierarchy with two cache levels. The level 1 cache would be close to the CPU (probably on-chip) and small enough to keep up with CPU speed. The level 2 cache would be much larger in capacity and probably off-chip, but it would still fast enough to significantly reduce the impact of the miss penalty. We need to consider the average AMAT for all levels of caches in the hierarchy. In doing so it means we must consider the AMAT of each cache and the AMAT for the level 2 cache is a function of the AMAT of the level 1 cache. AMAT = Hit-timeL1_cache + Miss-rateL1_cache * MisspenaltyL1_cache and the Miss-penaltyL1_cache = Hit-timeL2_cache + Miss-rateL2_cache * Miss-penaltyL2_cache Hence the overall AMAT for a two-level cache would be:

CompArch Aug 2016 CE

9

AMAT = Hit-timeL1_cache + Miss-rateL1_cache * (Hit-timeL2_cache + Miss-rateL2_cache * Miss-penaltyL2_cache) If there was a third level, we would have to consider the miss penalty of the L2_cache: Miss-penaltyL2_cache = Hit-timeL3_cache + Miss-rateL3_cache * Miss-penaltyL3_cache Our AMAT for a three level cache structure would then be: AMAT = Hit-timeL1_cache + Miss-rateL1_cache * (Hit-timeL2_cache + Miss-rateL2_cache * (Hit-timeL3_cache + Miss-rateL3_cache * Miss-penaltyL3_cache)) and so on for more cache levels. We can see that although hits in a particular cache level would reduce the impact on the miss penalty there comes a point where with increasing cache levels the miss penalty increases. Hence, again careful optimisation studies must be conducted to find the optimal miss penalty reduction for our clients’ purposes. What we are looking at is what is known as a local miss rate. Where the local miss rate is the miss rate for a particular cache level. Hence there is a local miss rate for the level 1 cache, the level 2 cache, the level 3 cache and so on. For a specific cache level the miss rate would be number of misses in that cache divided by the total number of memory access to this specific cache. We have also to consider the global miss rate, which would be the overall miss rate for all levels of caches in the the hierarchy. For a two-level cache structure the global miss rate would be the number of misses in the cache divided by the total number of memory access generated by the CPU to this specific cache i.e. the Miss RateL1_cache * Miss RateL2_cache.

4.

Improving Cache Memory Performance

Improving cache memory performance is not easy and requires a great deal of optimisation studies. We should attempt to reduce the hit time by keeping the capacity of our caches as small as possible and keeping the design of our caches as simple as possible.

CompArch Aug 2016 CE

10...