L1 Cache Controller

From NaplesPU Documentation
Revision as of 16:48, 21 June 2019 by Mirko (talk | contribs) (Stall Protocol ROMs)
Jump to: navigation, search

L1 cache controller manages the L1 cache. In particular, it handles only coherence information (such as state) since L1 data cache is managed by load/store unit.

LDST_CC

The component is composed of 4 stages:

  • stage 1: schedules a pending request to issue (from local core or network);
  • stage 2: contains coherence cache and MSHR;
  • stage 3: processes a request properly with coherence protocol;
  • stage 4: prepares coherence request/response to be sent on the network.

All these stages are represented in the figure below. The component has been realized in a pipelined fashion in order for the controller to be able to serve multiple requests at the same time.

Assumptions

The design has been driven by these assumptions:

  • cache controller can serve a request only when the previous one on the same address has been completely served;
  • transactions involve only memory blocks;
  • it's not possible to process two different requests that have the same set;
  • only requests from the local core (load, store, replacement) can allocate MSHR entries;
  • info regarding cache block in a non-stable state are stored in MSHR otherwise in the L1 cache.

Two requests on the same block cannot be issued one after another because when the first request is issued, it can modify an MSHR entry after two clock cycles (because of pipelining, in stage 3), hence the next request may read a non-up-to-date entry.

Stage 1

Stage 1 is responsible for the issue of requests to controller. A request could be a load miss, store miss, flush and replacement request from the local core or a coherence forwarded request or response from the network interface.

MSHR Signals

In order to find out if a request for the same block is already issued and pending, tag and set are provided by the MSHR for each type of pending request. MSHR data response are considered valid for that class of request if and only if its hit signal is asserted. Note that MSHR is in stage 2. Here is the code for the class of load miss signals:

// Signals to MSHR
assign cc1_mshr_lookup_tag[MSHR_LOOKUP_PORT_LOAD ]   = ci_load_request_address.tag;
assign cc1_mshr_lookup_set[MSHR_LOOKUP_PORT_LOAD ]   = ci_load_request_address.index;

// Signals from MSHR
assign load_mshr_hit                                 = cc2_mshr_lookup_hit[MSHR_LOOKUP_PORT_LOAD ];
assign load_mshr_index                               = cc2_mshr_lookup_index[MSHR_LOOKUP_PORT_LOAD ];
assign load_mshr                                     = cc2_mshr_lookup_entry_info[MSHR_LOOKUP_PORT_LOAD ];

Stall Protocol ROMs

In order to be compliant with the coherence protocol all incoming requests for blocks that are in a non-stable state have to be stalled. This task is performed through a series of protocol ROMs (one for each request type) that output signal will stall the issue of relative request if asserted, e.g. when a block is in state SM_A and a Fwd_GetS, Fwd_GetM, recall, flush, store or replacement request for the same block is received. In order to assert this signal, the protocol ROM needs the type of the request and the actual state of the block. Here is the stall logic:

stall_protocol_rom load_stall_protocol_rom (
  .current_request ( load            ),
  .current_state   ( load_mshr.state ),
  .pr_output_stall ( stall_load      )
);
stall_protocol_rom store_stall_protocol_rom (
   .current_request ( store            ),
   .current_state   ( store_mshr.state ),
   .pr_output_stall ( stall_store      )
);
stall_protocol_rom flush_stall_protocol_rom (
   .current_request ( flush            ),
   .current_state   ( flush_mshr.state ),
   .pr_output_stall ( stall_flush      )
);
stall_protocol_rom replacement_stall_protocol_rom (
    .current_request ( replacement            ),
    .current_state   ( replacement_mshr.state ),
    .pr_output_stall ( stall_replacement      )
);
stall_protocol_rom forwarded_stall_protocol_rom (
   .current_request ( fwd_2_creq( ni_forwarded_request.packet_type ) ),
   .current_state   ( forwarded_request_mshr.state                   ),
   .pr_output_stall ( stall_forwarded_request                        )
);

Note that response messages are never stalled in the coherence protocol, such requests are stalled only if a pending request with the same set index is already in the pipeline:

assign can_issue_response                                      = ni_response_valid &
	!(
		( cc2_pending_valid && ( ni_response.memory_address.index == cc2_pending_address.index ) ) ||
		( cc3_pending_valid && ( ni_response.memory_address.index == cc3_pending_address.index ) )
	);

Request Issue Signals

In order to issue a generic request it is required that:

  • MSHR has not been issued a request for the same block;
  • if the request is already in MSHR it has to be not valid;
  • if the request is already in MSHR and valid it must not have been stalled by Protocol ROM (see stall signals).
  • further stages are not serving a request for the same address (see assumptions);
  • network interface is available;
assign can_issue_load = ci_load_request_valid && 

       ( !load_mshr_hit || 
            ( load_mshr_hit && !load_mshr.valid) ||
             ( load_mshr_hit && load_mshr.valid  && load_mshr.address.tag == ci_load_request_address.tag && !stall_load ) ) &&

       ! (( cc2_pending_valid && ( ci_load_request_address.index == cc2_pending_address.index ) ) ||
            ( cc3_pending_valid && ( ci_load_request_address.index == cc3_pending_address.index ) ))  &&

       ni_request_network_available;

Response messages doesn't need conditions for MSHR because they never use it (they never wait for following events) and are never stalled. The same goes for flush requests even though they could be stalled by the relative stall protocol ROM. Note that unlike directory controller's MSHR there is not a control about the filling of MSHR. Because of the assumptions that only a request per thread can be issued and only threads can allocate a MSHR entry, it is sufficient to size MSHR to the number of threads in order for the MSHR to be never full and make the control about his filling useless.

Finally a replacement request could be pre-allocated in MSHR (see MSHR update logic). In order for this request to be issued before every other request on the same block, an additional condition is added:

assign can_issue_replacement = 
...
( !replacement_mshr_hit ||
    ( replacement_mshr_hit && !replacement_mshr.valid) ||
    ( replacement_mshr_hit && replacement_mshr.valid && ( !stall_replacement || replacement_mshr.waiting_for_eviction ) ) ) 
...

Requests Scheduler

Once the conditions for the issue have been verified, two or more requests could be ready at the same time so a scheduler must be used. Every request has a fixed priority whose order is set as below:

  1. flush
  2. replacement
  3. store miss
  4. coherence response
  5. coherence forwarded request
  6. load miss

Once a type of request has been scheduled this block drives conveniently the output signals for the second stage.

Stage 2

Stage 2 is responsible for managing L1 cache, the MSHR and forwarding signals from Stage 1 to Stage 3. It simply contains the L1 coherence cache (L1 data cache is in load/store unit) and all related logic for managing cache hits and block replacement. The policy used to replace a block is LRU (Least Recently Used).
This module receives signals from stage 3 to update MSHR and coherence cache properly once a request is processed and from load/store unit to update LRU every time a block is accessed from the core.

Hit/miss logic

Cache lookup is managed in a particular way so a deeper description of its operations have to be made.

Lookup phase is split in two parts performed by:

  1. load/store unit;
  2. cache controller (stage 2).

Load/store unit performs the first lookup using only request's set; so it returns an array of tags (for each way) whose tags have the same set of the request and their privilege bits. This first lookup is performed at the same time the request is in cache controller stage 1. The second phase of lookup is performed by cache controller stage 2 using only request's tag; this search is performed on the array provided by load/store unit. If there is a block with the same tag and the block is valid (its validity is checked with privilege bits) then an hit is occurred and the way index of that block is provided to stage 3. The way index will be used by stage 3 to perform updates to coherence data of that block.
If there isn't a block with the same tag as request's then hit isn't occured so stage 3 will take the way index provided by LRU unit in order to replace that block (see replacement logic).

...
// Second phase lookup
// Result of this lookup is an array one-hot codified 
assign snoop_tag_way_oh[dcache_way] = ( ldst_snoop_tag[dcache_way] == cc1_request_address.tag ) & ( ldst_snoop_privileges[dcache_way].can_read | ldst_snoop_privileges[dcache_way].can_write );
...

assign snoop_tag_hit       = |snoop_tag_way_oh;

Note that when request arrives in stage 2 its way index in data cache isn't yet known (because hit/miss logic is computing it at the same time) hence coherence cache is looked up only with request's set and will forward to stage 3 an array of coherence data with the same set as request (one for each way). Stage 3 will know which way index to use for fetching correct data because meanwhile hit/miss logic will have provided it.

The choice of splitting lookup in two separate phases has been made in order to reduce the latency of the entire process.

MSHR

Miss Status Handling Register is used to handle cache lines data whose coherence transactions are pending; that is the case in which a cache block is in a non-stable state. Recall that only one request per thread can be issued, MSHR has the same entry as the number of hardware threads.

A MSHR entry comprises these data:

Valid Address Thread ID Wakeup Thread State Waiting For Eviction Ack Count Data
  • Valid: entry has valid data
  • Address: entry memory address
  • Thread ID: requesting hw thread id
  • Wakeup Thread: wakeup thread when transaction is over
  • State: actual coherence state
  • Waiting for eviction: asserted for replacement requests
  • Ack count: remaining acks to receive
  • Data: data associated to request

Note that entry's Data part is stored in a separate memory in order to ease the lookup process.

Implementation details

Because MSHR has to provide a lookup service to stage 1 (see lookup signals) and update entries coming from stage 3 (see update signals) at the same time then a read port and a write port have been implemented.

Write port

A write policy is defined in order to define an order between writes and reads. This policy is can be set through a boolean parameter named WRITE_FIRST.
In particular this module is instantiated with policy WRITE_FIRST set to false, this means MSHR will serve read operations before write operations; that is write operations are delayed of one clock cycle after they have been issued from stage 3 (because a register delays the update). Here is the code regarding write port:

// This logic is generated for each MSHR entry
generate
  genvar mshr_id;
    for ( mshr_id = 0; mshr_id < `MSHR_SIZE; mshr_id++ ) begin : mshr_entries
 
      ...
      // Write policy
      if (WRITE_FIRST == "TRUE")
         // If true writes are serviced immediately
         assign data_updated[mshr_id] = (enable && update_this_index) ? update_entry : data[mshr_id];
      else
         assign data_updated[mshr_id] = data[mshr_id];

      ...
      
      // Data entries (set of registers)
      always_ff @(posedge clk, posedge reset) begin
         if (reset)
            data[mshr_id] <= 0;
         else if (enable && update_this_index)
            data[mshr_id] <= update_entry;
      end

    end
endgenerate
Read port

Read port implements a simple hit/miss logic for requests coming from stage 1 (see lookup signals). Write policy influents which data this logic will read though; if WRITE_FIRST is set to true then lookup is made on data just updated by write logic otherwise the lookup will be made before an update, the latter is the case when reads have more priority than writes (WRITE_FIRST is false). Here is the code regarding lookup logic:

// This logic is generated for each MSHR entry
...
generate
  for ( i = 0; i < `MSHR_SIZE; i++ ) begin : lookup_logic

      // data_updated[] data are set according to write policy
      assign hit_map[i] = ( data_updated[i].address.index == index ) && data_updated[i].valid;

   end
 endgenerate
...

assign hit        = |hit_map;

Stage 3

Stage 3 is responsible for the actual execution of requests. Once a request is processed, this stage issues signals to the units in the above stages in order to update data properly.
In particular, this stage drives datapath to perform one of these functions:

  • block replacement evaluation;
  • MSHR update;
  • cache memory (both data and coherence info) update.
  • preparing outgoing coherence messages.

Current State Selector

Before a request is processed by coherence protocol the correct source of cache block state has to be chosen. These data could be retrieved from:

  • MSHR;
  • coherence data cache;

If none of the conditions above are met then cache block must be in state I because it has not been ever read or modified.

Protocol ROM

This module implements the coherence protocol as represented in figure below. The choice to implement the protocol as a separate ROM has been made to ease further optimizations or changes to the protocol. It takes in input the current state and the request type and computes next actions.

MSI_CC

The coherence protocol used is MSI plus some changes due to the directory's inclusivity. In particular a new type of forwarded request has been added, recall, that is sent by directory controller when a block has to be evicted from L2 cache. A writeback response to the memory controller follows in response to a recall only when the block is in state M. Note that a writeback response is sent to the directory controller as well in order to provide a sort of acknowledgement.

Furthermore another type of request, flush (not present in figure), has been added that simply send updated data to the memory. It also generates a writeback response even though it is directed only to memory controller and doesn't change its coherence block state while a writeback response to a recall invalidates that block. (scrivere xkè si è aggiunto qsto messaggio di flush).

The above table refers to a baseline protocol which explains the main logic behind the Protocol ROM. Further optimizations are omitted in the above table, as the uncoherent state, while are deeply described in detail in MSI Protocol.

Replacement Logic

A cache block replacement is required when a new block has to be saved into a full L1 cache. In order to do that a block in L1 cache has to be chosen for a replacement. This replaced block must be valid in order for a replacement to be issued; block validity is assured by privilege bits associated to it. These privilege bits (one for each way) come from Stage 2 that in turn has received them from load/store unit. Index of way to replace is provided by LRU unit in Stage 2 that selects least used way.

replaced_way_valid                     = cc2_request_snoop_privileges[cc2_request_lru_way_idx].can_read | cc2_request_snoop_privileges[cc2_request_lru_way_idx].can_write;

Afterwards the address of the replaced way has to be computed. In particular its tag is provided by tag cache from load/store unit (through Stage 2) while the index is provided by the request (tag cache does't contain index data). Coherence data are provided by coherence data cache in Stage 2. Offset is not needed because the entire block cache is replaced so offset is all 0s.

replaced_way_address.tag               = cc2_request_snoop_tag[cc2_request_lru_way_idx];
replaced_way_address.index             = cc2_request_address.index;
replaced_way_address.offset            = {`DCACHE_OFFSET_LENGTH{1'b0}};

replaced_way_state                     = cc2_request_coherence_states[cc2_request_lru_way_idx];

Finally a replacement request has to be issued if:

  • protocol ROM requested for a cache update;
  • the block requested is not present in L1 cache (so the update request must be a block allocation);
  • replaced block is valid.
do_replacement                        = pr_output.write_data_on_cache && !cc2_request_snoop_hit && replaced_way_valid;

MSHR Update Logic

MSHR could be updated in three different ways:

  • entry allocation;
  • entry deallocation;
  • entry update.

MSHR is used to store cache lines data whose coherence transactions are pending. This is the case in which a cache line is in a non-stable state. So an entry allocation is made every time the cache line's state moves towards a non-stable state. In opposite way a deallocation is made when a cache line's state enters a stable state. Finally an update is made when there is something to change regarding the MSHR line but cache line's state is yet non-stable. Each condition is represented by a signal that is properly asserted by protocol ROM.

cc3_update_mshr_en            = ( pr_output.allocate_mshr_entry || pr_output.update_mshr_entry || pr_output.deallocate_mshr_entry );

MSHR update data depends on whether a replacement is available, that is when do_replacement is asserted. When there is not a replacement then a MSHR entry is allocated or updated according to signals from protocol ROM while when there is a replacement then data from Replacement Logic are used to allocate a MSHR entry. Actually when do_replacement is asserted then an MSHR entry is pre-allocated. This is necessary otherwise data computed by Replacement Logic about replaced way would get lost. In order for Stage 1 to remember that the entry is only a pre-allocation (remember that MSHR stores cache block data whose coherence transactions are already pending but in this case coherence transaction is not yet started) bit waiting_for_eviction is asserted (see Request Issue Signals).

Note that if the update is an entry allocation then the index of an empty entry is provided directly by MSHR (through Stage 2). Remember that at this point there is surely an empty MSHR entry otherwise the request would have not been issued (see Request Issue Signals). If the operation is an update or a deallocation then the index is obtained from Stage 1 (through Stage 2) in which MSHR is queried (see MSHR Signals) for the index of the entry associated with the actual request.

cc3_update_mshr_index      = cc2_request_mshr_hit ? cc2_request_mshr_index : cc2_request_mshr_empty_index;

Cache Update Logic

Both data cache and coherence cache could be updated after a coherence transaction has been computed. Data cache data is updated according to the occurrence of a replacement, in that case command CC_REPLACEMENT is issued to load/store unit; this command ensures load/store unit will prepare the block for an eviction. Otherwise an update to cache block has to be made; if the update data regards only data privileges then CC_UPDATE_INFO command is issued otherwise command CC_UPDATE_INFO_DATA is issued when both the new block and its privileges has to be written in L1 cache.

// Data cache signals
assign cc3_update_ldst_command          = do_replacement ? CC_REPLACEMENT : ( pr_output.write_data_on_cache ? CC_UPDATE_INFO_DATA : CC_UPDATE_INFO );
assign cc3_update_ldst_way              = cc2_request_snoop_hit ? cc2_request_snoop_way_idx : cc2_request_lru_way_idx;
...
 
// Coherence cache signals
assign cc3_update_coherence_state_index = cc2_request_address.index;
assign cc3_update_coherence_state_way   = cc2_request_snoop_hit ? cc2_request_snoop_way_idx : cc2_request_lru_way_idx;
assign cc3_update_coherence_state_entry = pr_output.next_state;

Note the control about the cache way to update; if the block is already present in L1 cache then its way index is used otherwise the least recently used way index is used, that is the case of a replacement.

Data cache is updated every time there is the need to change privileges for a block already present is L1 cache or when a new block is received and has to be written along with its privileges.

...
cc3_update_ldst_valid         = ( pr_output.update_privileges && cc2_request_snoop_hit ) || pr_output.write_data_on_cache;
...

Coherence cache is updated when block state became stable (for pending requests MSHR stores these data) and there is a cache hit. If there is not a cache hit then protocol ROM must have the necessity to write a new data block on L1 cache.

...
cc3_update_coherence_state_en = ( pr_output.next_state_is_stable && cc2_request_snoop_hit ) || ( pr_output.next_state_is_stable && !cc2_request_snoop_hit && pr_output.write_data_on_cache );
...

Furthermore if the request is a forwarded coherence request then L1 cache data are forwarded to the message generator in stage 4 in order to be sent to the requestor.

assign cc3_snoop_data_valid             = cc2_request_valid && pr_output.send_data_from_cache;
assign cc3_snoop_data_set               = cc2_request_address.index;
assign cc3_snoop_data_way               = cc2_request_snoop_way_idx;

Stage 4

This stage generates a correct request/response message for the network interface whenever a message is issued from the third stage.

See Also

Coherence