Difference between revisions of "Load/Store unit"

From NaplesPU Documentation
Jump to: navigation, search
(Stage 3)
Line 1: Line 1:
This is the unit inside the core that executes the load and store operations. It contains an L1 data cache inside itself in order to reduce the memory access latency. It is divided in three stages (more details will be furnished further). It basically interfaces the Operand fetch stage and the Writeback stages. Furthermore, it sends to instruction buffer unit a signal in order to stop a thread when a miss raises.
 
Note that the signals to the writeback stage go to the cache controller (throughout the core interface module) as well.
 
  
The Load Store Unit does not store specific coherence protocol information (as stable states) but it stores privileges for all cached addresses. Each cache line has two privileges: ''can read'' and ''can write''. Those privileges determine cache misses/hits and are updated only by the Cache Controller.
+
[[File:Load-Store_unit.jpg|1000px]]
 +
 
 +
Load/Store unit provides access to the L1 N-way set-associative cache and manages load and store operations coming from the core; it is structured according to a Pipeline architecture divided into three stages (that will be explained in detail below).
 +
It interfaces the [[Core|Operand fetch stage]] and [[Core|Writeback stage]] on the core side, dually on the bus side it communicates with the [[L1 Cache Controller|cache controller]] which updates information (namely tags and privileges) and data. Moreover, such a module sends a signal to the instruction buffer unit which has the purpose of stopping a thread in case of miss.
 +
Load/Store unit does not store any information about the coherency protocol used, although it keeps track of information regarding privileges on all cache addresses. Each cache line stored, in fact, has two privileges: ''can read'' and ''can write'' that are used to determine cache miss/hit and are updated by the Cache Controller.
 +
Finally, it should be noted that this unit does not manage addresses that refer to the IO address space: whenever an issued request belongs to the non-coherent memory space, this is directly forwarded to the Cache Controller which will handle bypassing the coherency protocol, sending the request to the memory controller and report the data back to the third stage.
  
 
== Stage 1 ==
 
== Stage 1 ==
  
This is the first stage of the Load/Store Pipeline Unit. This stage has one queue per thread in which store the threads-relative instructions coming from the Operand Fetch, then provides in parallel one instruction per thread to the second stage .
+
This stage contains a queue for each thread, each of them enqueues in such queues load/store instructions from the Operand Fetch waiting to be scheduled.
 
+
Before the requests are queued, a phase of decoding and verification of the type of operation is performed (to understand if it is a word, half word or byte), afterwards the control logic checks the alignment of the incoming request. Alignment is done based on operations (both scalar and vector) on byte, half-word and word. In case of vector alignment, the data is compressed so that it can be written consecutively. For example, a vec16i8 - that has 1 significant byte each 4 bytes - is compressed to have 16 consecutive bytes.
If the second stage is able to execute the instruction provided by the i-th thread, it asserts combinatorially the i-th bit of the ldst2_dequeue_instruction mask in order to notify the stage1 that the instruction has been consumed. In this way, the second stage stalls the instructions in this stage, if it is busy.
+
Then, the instruction in the FIFO related to that thread is enqueued into its corresponding queue, but only if the instruction is valid or a flush request, and if there is no rollback occurring.  
  
Before to enqueue the request, the data are aligned and compressed in a proper vector and replicated if necessary. The alignment is done for byte, half-word and word operation. The vectorial alignment implies the compression of the data in order to write it consecutively. For example, a vec16i8 - that has got 1 significative byte each 4 bytes - is compressed to have 16 consecutive bytes.
+
assign fifo_enqueue_en  = ( instruction_valid | is_flush ) &&  opf_inst_scheduled.thread_id == thread_id_t'( thread_idx ) && !ldst1_rollback_en;
  
'''PER MIRKO: SECONDO ME NON FUNZIONA STA COSA SE CI OPERI DI NUOVO'''
+
The instruction is then loaded (scalar or vector) appropriately aligned in the FIFO.
  
The flush operation forces the data to be enqueued, even if the instruction_valid signal is not asserted. xxx
+
if (is_halfword_op ) begin
 +
  fifo_input.store_value = halfword_aligned_scalar_data;
 +
  fifo_input.store_mask  = halfword_aligned_scalar_mask;
 +
  ...
 +
end
  
'''PER MIRKO: SECONDO MEINVECE CI VUOLE INSTRUCTION VALID'''
 
  
The stage contains a recycle buffer: if a cache miss occurs in the 3th stage, the data is putted in this buffer. The output of this this buffer competes with the normal issued load/store instruction to be re-executed. The recycled instructions have an higher priority respect to the other operations.
+
Finally, a pending instruction for each thread is forwarded in parallel to the next stage. Whenever the next stage can execute an instruction provided by the i-th thread, it asserts the ''ldst2_dequeue_instruction'' signal that notifies the stage 1 whether the instruction has been consumed or not. In this way, it is possible to stall the instruction at the head of each FIFO if the next stage is busy.
 +
Moreover, stage 1 is equipped with a recycle buffer: if a cache miss occurs in the Stage 3, the instruction and its operands are saved in this buffer. The output of this buffer is in competition with the normal load/store instructions, but the recycled instructions have a higher priority than the other operations.
  
Note that this stage consumes much memory space because the queues store the entire instructions and the relative fetched operands.
 
  
 
== Stage 2 ==
 
== Stage 2 ==
  
This stage has the main scope to choose a request to serve and to fetch tag&privileges from the tag cache.
+
In this stage information are managed (tags and privileges), while data are handled by the next stage. Those informations are updated by the cache controller, based on the implemented coherency protocol (MSI by default).
 
 
It receives from the previous stage the load/store requests and the recycled request for each thread ( ldst1_valid and ldst1_recycle_valid) while it receives the update signal from the cache controller(cc_update_ldst_xxx).
 
  
About the signals from the first stage of the load/store unit, each thread can issue a request (normal or recycled) and choosed in a round-robin manner (ldst1_fifo_winner). After choosed a thread, if both normal and recycled request are active for that winner thread, the recycled request ha ever the maximum priority. This choose is mandatory if we want respect the scheduling order.
+
This stage receives as many parallel requests from the core as the number of threads. A pending request is selected from the set of pending requests from the previous stage.
 +
A thread is chosen by a Round Robin arbiter; first the selected thread index is obtained, if it has both a valid recycled and FIFO requests, then the former will be served first because it has higher priority.
  
 
  always_comb begin
 
  always_comb begin
    if ( ldst1_recycle_valid[ldst1_fifo_winner_id] ) begin
+
  if ( ldst1_recycle_valid[ldst1_fifo_winner_id] ) begin
      ...
+
      ...
    end else begin
+
  end else begin
      ... (normal request)
+
      ... (normal request)
    end
+
  end
 
 
Both kind of request are dequeued using two different signals (ldst2_dequeue_instruction and ldst2_recycled).
 
 
 
=== Cache controller input dependencies ===
 
 
 
==== Cache update signals ====
 
  
Over the stage 1 requests, the cache update signal from the cache controller has the highest priority over the other ones. The signal cc_update_ldst_valid is important and establishes when the cache controller wants to update the cache. So, the highest priority is dispatched throughout these signals:
+
Note that, if the cache controller sends a request (asserting the ''cc_update_ldst_valid signal'') then the arbiter gives maximum priority to it and takes no instructions from the FIFO. Priority is managed by these signals:
  
 
  ldst1_fifo_requestor      = ( ldst1_valid | ldst1_recycle_valid ) & {`THREAD_NUMB{~cc_update_ldst_valid}} & ~sleeping_thread_mask_next;
 
  ldst1_fifo_requestor      = ( ldst1_valid | ldst1_recycle_valid ) & {`THREAD_NUMB{~cc_update_ldst_valid}} & ~sleeping_thread_mask_next;
  ldst1_request_valid      = |ldst1_fifo_requestor;
+
//if there is a valid request
  tag_sram_read1_address    = ( cc_update_ldst_valid ) ? cc_update_ldst_address.index : ldst1_fifo_request.address.index;
+
  ldst1_request_valid      = |ldst1_fifo_requestor;  
  next_request              = ( cc_update_ldst_valid ) ? cc_update_request           : ldst1_fifo_request;
+
  tag_sram_read1_address    = ( cc_update_ldst_valid ) ? cc_update_ldst_address.index : ldst1_fifo_request.address.index;
 +
  next_request              = ( cc_update_ldst_valid ) ? cc_update_request            : ldst1_fifo_request;
 
  ldst2_valid              = ldst1_request_valid;
 
  ldst2_valid              = ldst1_request_valid;
  
The cc_update_ldst_valid determines if a request from the previous stage is valid (ldst1_fifo_requestor) and what is the request that can access to the tag read port (tag_sram_read1_address). Consequently, it determines what is the correct output request to the third stage (next_request). The output request contains the tags and other values (e.g. ldst2_hw_lane_mask) needed to the 3rd stage to execute its tasks.
+
As stated above, this stage manages tag and privileges. The tag caches are accessed in parallel, for such a reason they are equipped with 2 read and one write ports; the second reading port is exclusively used by the Cache Controller for managing the coherency protocol (through its privileged bus called ''cc_snoop_request'').  
 
+
Writes are performed by the Core or Cache Controller. Requests from the Core (load or store) at this stage are always performed in reading tags and privileges; in case of store hit writes requests from the Core are finalized in the next stage.
The cache controller update signal is the unique signal in charge to write the tag cache (troughout the write tag port) and the cache privileges.
 
  
==== Snoop signals ====
+
The cache controller can send different commands to the Load/Store unit.
In parallel, another reading can happen because of a snooping request by the cache controller. This reading is done througout a second tag read port. So, in the end, there are two read ports and one write port. The two read ports have ever the priority over the write port because the cache controller sometimes could read a chache line and simoultaneously write over it. In order to avoid that it reads the same data it is writing, the read port has the highest priority over the write operation.
+
In case of '''CC_INSTRUCTION''', the stage 2 is receiving an instruction that could not be served previously. The cache controller, in this case, provides the new privileges, tags and data. So, this stage bypasses the new data (privileges, tags and data) to the next one in order to "complete" the operation (whether it is a load or a store). In detail, when ''ldst2_valid'' signal is asserted, it propagates the ''ldst2_instruction'', the new address is passed on ''ldst2_address'', the data is passed over ''ldst2_store_value signal'', with related masks, tags and privileges.
  
==== Command signals ====
+
In case of '''CC_UPDATE_INFO''' from the cache controller, stage 2 receives a command in order to update the information (tags and privileges) of a given set. In this case nothing is propagated and no validation signal is asserted to the next stage, since it is not necessary to update the data cache. The privileges and the cache tag are updated with the values provided by the cache controller. It also indicates which way must be updated being responsible for the PseudoLRU.
The Cache Controller can send four type of command.
 
* If a INSTRUCTION command is send by Cache Controller, a pending instruction has to complete.
 
* If a UPDATE_INFO occurs, this stage updates infos and nothing is propagated to the next one.
 
* If a UPDATE_INFO_DATA command is send by Cache Controller, the current stage has to update infos, using index, tag and privileges send by Cache Controller. Furthermore, This stage must forward those information and the store_value to the data cache stage.
 
* If an EVICT occurs, the next stage requires the evicted tag and the new index to construct the evicting address.
 
  
assign cc_command_is_update_data = cc_update_ldst_valid & ( cc_update_ldst_command == CC_UPDATE_INFO_DATA | cc_update_ldst_command == CC_REPLACEMENT );
+
In case of '''CC_UPDATE_INFO_DATA''' from the cache controller, stage 2 receives an update command, but in this case must also be updated data cache. Stage 2 propagates the address information (which are used to identify the index relative to the set), way and value provided by the cache controller. As in the previous case, the privileges and the cache tag are updated with the values provided by the cache controller. In this case the notification is made by asserting ''ldst2_update_valid'' and stage 3 should not propagate anything to the Writeback stage. In detail, only ''ldst2_update_valid'' is asserted, the way is ''ldst2_update_way'', the new address is passed using ''ldst2_address'', the data passed using ''ldst2_store_value''.
assign cc_command_is_evict      = cc_update_ldst_valid & ( cc_update_ldst_command == CC_REPLACEMENT );
 
  
The first signal is asserted if the cache controller wants to read and modify a cache line at the same time, e.g. for a replacement event. This is the the reason about the highest priority to the read ports.
+
In case of '''CC_EVICT''' from the CC, stage 2 must notify next one that the data must be replaced. In this case the notification is made by asserting ldst2_evict_valid and providing the complete address of the data to be replaced (which will have the same index as the new one), the way to use and the data. In detail, ''ldst2_evict_valid'' and ''ldst2_update_valid'' are asserted, the data is passed on ''ldst2_store_value'', the old address is composed of stage 3 using the past tag and the index of the new one, the way is ''ldst2_update_way''.  
  
The second signal is asserted only if a read has to be executed (e.g. a cache line eviction).
+
The last responsibility of this stage is to wake a thread up when a cache miss occurs, thread that caused a cache miss is asleep (via the ldst3_thread_sleep signal);
 
 
==== Thread wakeup signals ====
 
The last resposability is about the thead wake-up.
 
  
 
  assign sleeping_thread_mask_next = ( sleeping_thread_mask | ldst3_thread_sleep ) & ( ~thread_wakeup_mask );
 
  assign sleeping_thread_mask_next = ( sleeping_thread_mask | ldst3_thread_sleep ) & ( ~thread_wakeup_mask );
 
  assign thread_wakeup_mask        = thread_wakeup_oh & {`THREAD_NUMB{cc_wakeup}};
 
  assign thread_wakeup_mask        = thread_wakeup_oh & {`THREAD_NUMB{cc_wakeup}};
  
If a cache miss stops a thread (ldst3_thread_sleep), when a cache transaction is completed, the cache controller asserts the cc_wakeup to restart again the thread.
+
== Stage 3 ==
 +
 
 +
This stage receives input instructions, data update requests and replacement requests.
 +
Before managing logic to generate the output it is necessary to understand the type of request received (instruction, data update, replacement). Moreover, in case of instruction the control logic have to check if there is a hit or a miss and if there is a ''load_miss'' or ''store_miss'' (that occur if we do not have the necessary privileges).
  
== Stage 3 ==
+
assign is_instruction = ldst2_valid & ~ldst2_is_flush;
This unit primary receives the cached tag&privileges from the previous stage in order to execute the hit/miss detection
+
assign is_replacement = ldst2_update_valid && ldst2_evict_valid;
 +
assign is_update = ldst2_update_valid && !ldst2_evict_valid;
 +
assign is_flush = ldst2_valid & ldst2_is_flush & is_hit & ~cr_ctrl_cache_wt;
 +
assign is_store = is_instruction && !ldst2_instruction.is_load;
 +
assign is_load = is_instruction && ldst2_instruction.is_load;
  
for ( dcache_way = 0; dcache_way < `DCACHE_WAY; dcache_way++ )  
+
A cache hit is asserted if we have read or write privileges on such address, and if the tag of the requested address is equal to an element present in the tag array (dcache_way) read from the tag cache using the address set (passed from the Stage 2).
    assign way_matched_oh[dcache_way] = ( ( ldst2_tag_read[dcache_way] == ldst2_address.tag && ( ldst2_privileges_read[dcache_way].can_write || ldst2_privileges_read[dcache_way].can_read ) ) && ldst2_valid );
 
assign is_hit              = |way_matched_oh;
 
  
In case of update (just write operation) or replacement (both read and write operation) the read and write ports of the data cache are enabled.
+
for ( dcache_way = 0; dcache_way < `DCACHE_WAY; dcache_way++ ) begin : HIT_MISS_CHECK
 +
        assign way_matched_oh[dcache_way] = ( ( ldst2_tag_read[dcache_way] == ldst2_address.tag &&
 +
        (ldst2_privileges_read[dcache_way].can_write || ldst2_privileges_read[dcache_way].can_read ) ) && ldst2_valid );
 +
  end
 +
assign is_hit = |way_matched_oh;
  
assign is_replacement      = ldst2_update_valid && ldst2_evict_valid;
+
Then it must be verified if we have permissions to perform operations on the block for which there was a hit. E.g., if there is a hit for that address and the request is a store, we must have write permissions for that block otherwise there will be a store miss.
assign is_update            = ldst2_update_valid && !ldst2_evict_valid;
 
always_comb begin
 
      .....
 
      ....
 
    end else if ( is_update ) begin
 
        data_sram_read_enable  = 1'b0;
 
        data_sram_write_enable  = {( `DCACHE_WIDTH/8 ){1'b1}};
 
        ....
 
    end else if ( is_replacement ) begin
 
        data_sram_read_enable  = 1'b1;
 
        data_sram_write_enable  = {( `DCACHE_WIDTH/8 ){1'b1}};
 
          ....
 
    end
 
  
There is a second read port in order to execute the data snoop requests. This port is WRITE FIRST so the Cache Controller receives the last version of data also when a store instruction is about to be performed on the same cache line.
+
assign is_store_hit = is_store && ldst2_privileges_read[way_matched_idx].can_write && is_hit;
 +
assign is_store_miss = is_store && ~is_store_hit;
 +
assign is_load_hit  = is_load && ldst2_privileges_read[way_matched_idx].can_read && is_hit;
 +
assign is_load_miss = is_load && ~is_load_hit;
  
The principal output of this stage is to determine if one of these events happened: cache miss, eviction, and flushing, plus another important signal about the thread sleeping if a miss occurs.
+
Therefore, in case of store hit it saves data in the data cache (and does not send anything outwards), in case of load hit the ''ldst3_valid'' signal is asserted, and data will be sent to the WriteBack module. In case of store/load miss the ldst3_miss is asserted towards the cache controller. Furthermore, in the case of miss the thread must be put to sleep.
  
 
  ldst3_thread_sleep[thread_idx] = ( is_load_miss || is_store_miss ) && ldst2_instruction.thread_id == thread_id_t'( thread_idx );
 
  ldst3_thread_sleep[thread_idx] = ( is_load_miss || is_store_miss ) && ldst2_instruction.thread_id == thread_id_t'( thread_idx );
  ldst3_miss  = is_load_miss || is_store_miss;
+
 
ldst3_evict = is_replacement;
+
In case of update operations, the data cache line identified by the signals pair {ldst2_address.index, ldst2_update_way} is updated with the new data ldst2_store_value coming from previous stage.
ldst3_flush = ldst2_is_flush;
+
 
 +
Finally, for replacement operations, similarly to the update case, the cache line addressed by {''ldst2_address.index'', ''ldst2_update_way''} is updated with the new value ''ldst2_store_value'', and its old content sent to the cache controller through the output ''ldst3_cache_line'' asserting the ''ldst3_evict'' signal. In detail, are sent to the controller cache: the old address on ''ldst3_address''; the old cache line contents on ''ldst3_cache_line'' and ''ldst3_evict'' is asserted.
 +
 
 +
Note that, there is a second read port for the data cache that is used by the cache for snoop requests (that is, the read requests that are used by the controller cache to read the data, is a sort of fast lane). This port is ''WRITE FIRST'' so the cache controller receives the latest version of the data even when we are doing a store operation on that cache line.

Revision as of 22:45, 20 July 2018

Load-Store unit.jpg

Load/Store unit provides access to the L1 N-way set-associative cache and manages load and store operations coming from the core; it is structured according to a Pipeline architecture divided into three stages (that will be explained in detail below). It interfaces the Operand fetch stage and Writeback stage on the core side, dually on the bus side it communicates with the cache controller which updates information (namely tags and privileges) and data. Moreover, such a module sends a signal to the instruction buffer unit which has the purpose of stopping a thread in case of miss. Load/Store unit does not store any information about the coherency protocol used, although it keeps track of information regarding privileges on all cache addresses. Each cache line stored, in fact, has two privileges: can read and can write that are used to determine cache miss/hit and are updated by the Cache Controller. Finally, it should be noted that this unit does not manage addresses that refer to the IO address space: whenever an issued request belongs to the non-coherent memory space, this is directly forwarded to the Cache Controller which will handle bypassing the coherency protocol, sending the request to the memory controller and report the data back to the third stage.

Stage 1

This stage contains a queue for each thread, each of them enqueues in such queues load/store instructions from the Operand Fetch waiting to be scheduled. Before the requests are queued, a phase of decoding and verification of the type of operation is performed (to understand if it is a word, half word or byte), afterwards the control logic checks the alignment of the incoming request. Alignment is done based on operations (both scalar and vector) on byte, half-word and word. In case of vector alignment, the data is compressed so that it can be written consecutively. For example, a vec16i8 - that has 1 significant byte each 4 bytes - is compressed to have 16 consecutive bytes. Then, the instruction in the FIFO related to that thread is enqueued into its corresponding queue, but only if the instruction is valid or a flush request, and if there is no rollback occurring.

assign fifo_enqueue_en  = ( instruction_valid | is_flush ) &&  opf_inst_scheduled.thread_id == thread_id_t'( thread_idx ) && !ldst1_rollback_en;

The instruction is then loaded (scalar or vector) appropriately aligned in the FIFO.

if (is_halfword_op ) begin
  fifo_input.store_value = halfword_aligned_scalar_data; 
  fifo_input.store_mask  = halfword_aligned_scalar_mask;
  ...
end


Finally, a pending instruction for each thread is forwarded in parallel to the next stage. Whenever the next stage can execute an instruction provided by the i-th thread, it asserts the ldst2_dequeue_instruction signal that notifies the stage 1 whether the instruction has been consumed or not. In this way, it is possible to stall the instruction at the head of each FIFO if the next stage is busy. Moreover, stage 1 is equipped with a recycle buffer: if a cache miss occurs in the Stage 3, the instruction and its operands are saved in this buffer. The output of this buffer is in competition with the normal load/store instructions, but the recycled instructions have a higher priority than the other operations.


Stage 2

In this stage information are managed (tags and privileges), while data are handled by the next stage. Those informations are updated by the cache controller, based on the implemented coherency protocol (MSI by default).

This stage receives as many parallel requests from the core as the number of threads. A pending request is selected from the set of pending requests from the previous stage. A thread is chosen by a Round Robin arbiter; first the selected thread index is obtained, if it has both a valid recycled and FIFO requests, then the former will be served first because it has higher priority.

always_comb begin
  if ( ldst1_recycle_valid[ldst1_fifo_winner_id] ) begin
     ...
  end else begin
     ... (normal request)
 end

Note that, if the cache controller sends a request (asserting the cc_update_ldst_valid signal) then the arbiter gives maximum priority to it and takes no instructions from the FIFO. Priority is managed by these signals:

ldst1_fifo_requestor      = ( ldst1_valid | ldst1_recycle_valid ) & {`THREAD_NUMB{~cc_update_ldst_valid}} & ~sleeping_thread_mask_next;
//if there is a valid request
ldst1_request_valid       = |ldst1_fifo_requestor; 
tag_sram_read1_address    = ( cc_update_ldst_valid ) ? cc_update_ldst_address.index : ldst1_fifo_request.address.index;
next_request              = ( cc_update_ldst_valid ) ? cc_update_request            : ldst1_fifo_request;
ldst2_valid               = ldst1_request_valid;

As stated above, this stage manages tag and privileges. The tag caches are accessed in parallel, for such a reason they are equipped with 2 read and one write ports; the second reading port is exclusively used by the Cache Controller for managing the coherency protocol (through its privileged bus called cc_snoop_request). Writes are performed by the Core or Cache Controller. Requests from the Core (load or store) at this stage are always performed in reading tags and privileges; in case of store hit writes requests from the Core are finalized in the next stage.

The cache controller can send different commands to the Load/Store unit. In case of CC_INSTRUCTION, the stage 2 is receiving an instruction that could not be served previously. The cache controller, in this case, provides the new privileges, tags and data. So, this stage bypasses the new data (privileges, tags and data) to the next one in order to "complete" the operation (whether it is a load or a store). In detail, when ldst2_valid signal is asserted, it propagates the ldst2_instruction, the new address is passed on ldst2_address, the data is passed over ldst2_store_value signal, with related masks, tags and privileges.

In case of CC_UPDATE_INFO from the cache controller, stage 2 receives a command in order to update the information (tags and privileges) of a given set. In this case nothing is propagated and no validation signal is asserted to the next stage, since it is not necessary to update the data cache. The privileges and the cache tag are updated with the values provided by the cache controller. It also indicates which way must be updated being responsible for the PseudoLRU.

In case of CC_UPDATE_INFO_DATA from the cache controller, stage 2 receives an update command, but in this case must also be updated data cache. Stage 2 propagates the address information (which are used to identify the index relative to the set), way and value provided by the cache controller. As in the previous case, the privileges and the cache tag are updated with the values provided by the cache controller. In this case the notification is made by asserting ldst2_update_valid and stage 3 should not propagate anything to the Writeback stage. In detail, only ldst2_update_valid is asserted, the way is ldst2_update_way, the new address is passed using ldst2_address, the data passed using ldst2_store_value.

In case of CC_EVICT from the CC, stage 2 must notify next one that the data must be replaced. In this case the notification is made by asserting ldst2_evict_valid and providing the complete address of the data to be replaced (which will have the same index as the new one), the way to use and the data. In detail, ldst2_evict_valid and ldst2_update_valid are asserted, the data is passed on ldst2_store_value, the old address is composed of stage 3 using the past tag and the index of the new one, the way is ldst2_update_way.

The last responsibility of this stage is to wake a thread up when a cache miss occurs, thread that caused a cache miss is asleep (via the ldst3_thread_sleep signal);

assign sleeping_thread_mask_next = ( sleeping_thread_mask | ldst3_thread_sleep ) & ( ~thread_wakeup_mask );
assign thread_wakeup_mask        = thread_wakeup_oh & {`THREAD_NUMB{cc_wakeup}};

Stage 3

This stage receives input instructions, data update requests and replacement requests. Before managing logic to generate the output it is necessary to understand the type of request received (instruction, data update, replacement). Moreover, in case of instruction the control logic have to check if there is a hit or a miss and if there is a load_miss or store_miss (that occur if we do not have the necessary privileges).

assign is_instruction = ldst2_valid & ~ldst2_is_flush;
assign is_replacement = ldst2_update_valid && ldst2_evict_valid;
assign is_update = ldst2_update_valid && !ldst2_evict_valid;
assign is_flush = ldst2_valid & ldst2_is_flush & is_hit & ~cr_ctrl_cache_wt;
assign is_store = is_instruction && !ldst2_instruction.is_load; 
assign is_load = is_instruction && ldst2_instruction.is_load;

A cache hit is asserted if we have read or write privileges on such address, and if the tag of the requested address is equal to an element present in the tag array (dcache_way) read from the tag cache using the address set (passed from the Stage 2).

for ( dcache_way = 0; dcache_way < `DCACHE_WAY; dcache_way++ ) begin : HIT_MISS_CHECK
       assign way_matched_oh[dcache_way] = ( ( ldst2_tag_read[dcache_way] == ldst2_address.tag &&
       (ldst2_privileges_read[dcache_way].can_write || ldst2_privileges_read[dcache_way].can_read ) ) && ldst2_valid );
 end
assign is_hit = |way_matched_oh;

Then it must be verified if we have permissions to perform operations on the block for which there was a hit. E.g., if there is a hit for that address and the request is a store, we must have write permissions for that block otherwise there will be a store miss.

assign is_store_hit = is_store && ldst2_privileges_read[way_matched_idx].can_write && is_hit;
assign is_store_miss = is_store && ~is_store_hit;
assign is_load_hit  = is_load && ldst2_privileges_read[way_matched_idx].can_read && is_hit;
assign is_load_miss = is_load && ~is_load_hit;

Therefore, in case of store hit it saves data in the data cache (and does not send anything outwards), in case of load hit the ldst3_valid signal is asserted, and data will be sent to the WriteBack module. In case of store/load miss the ldst3_miss is asserted towards the cache controller. Furthermore, in the case of miss the thread must be put to sleep.

ldst3_thread_sleep[thread_idx] = ( is_load_miss || is_store_miss ) && ldst2_instruction.thread_id == thread_id_t'( thread_idx );

In case of update operations, the data cache line identified by the signals pair {ldst2_address.index, ldst2_update_way} is updated with the new data ldst2_store_value coming from previous stage.

Finally, for replacement operations, similarly to the update case, the cache line addressed by {ldst2_address.index, ldst2_update_way} is updated with the new value ldst2_store_value, and its old content sent to the cache controller through the output ldst3_cache_line asserting the ldst3_evict signal. In detail, are sent to the controller cache: the old address on ldst3_address; the old cache line contents on ldst3_cache_line and ldst3_evict is asserted.

Note that, there is a second read port for the data cache that is used by the cache for snoop requests (that is, the read requests that are used by the controller cache to read the data, is a sort of fast lane). This port is WRITE FIRST so the cache controller receives the latest version of the data even when we are doing a store operation on that cache line.