Difference between revisions of "Core"

From NaplesPU Documentation
Jump to: navigation, search
(Rollback handler)
(Branch unit)
Line 169: Line 169:
 
In case of unconditional jump, the input register is the effective address where to jump.  
 
In case of unconditional jump, the input register is the effective address where to jump.  
 
E.g. jmp s4  -> BC will jump to memory location stored in s4.
 
E.g. jmp s4  -> BC will jump to memory location stored in s4.
 +
 +
Note these two signals:
 +
assign bc_rollback_enable  = jump & opf_inst_scheduled.is_branch & opf_valid;
 +
assign bc_rollback_valid  = opf_valid && opf_inst_scheduled.pipe_sel == PIPE_BRANCH && ~bc_rollback_enable;
 +
 +
The enable is asserted only if a jump is taken. The valid is asserted just if a branch operation is executed, regardless of the result.
  
 
== Writeback stage ==
 
== Writeback stage ==

Revision as of 18:49, 22 September 2017

The core is based on a RISC in-order pipeline. Its control unit is intentionally kept lightweight. The architecture masks memory and operation latencies by heavily relying on hardware multithreading. By ensuring a light control logic, the core can devote most of its resources for accelerating computing in highly data-parallel kernels. In the hardware multithreading nuplus architecture, each hardware thread has its own PC, register file, and control registers. The number of threads is user configurable. A nuplus hardware thread is equivalent to a wavefront in the AMD terminology and a CUDA warp in the NVIDIA terminology. The processor uses a deep pipeline to improve clock speed.

nu+ microarchitecture

All threads share the same compute units. Execution pipelines are organized in hardware vector lanes (like vector processors, each operator is replicated N times). Each thread can perform a SIMD operation on independent data, while data are organized in a vector register file. The core supports a high-throughput non-coherent scratchpad memory, or SPM (corresponding to the shared memory in the NVIDIA terminology). The SPM is divided in a parameterized number of banks based on a user-configurable mapping function. The memory controller resolves bank collisions at run-time ensuring a correct execution of SPM accesses from concurrent threads. Coherence mechanisms incur a high latency and are not strictly necessary for many applications.

Instruction fetch stage

Instruction Fetch stage schedules the next thread PC from the eligible thread pool, handled by the Thread Controller. Available threads are scheduled in a Round Robin fashion. Furthermore, at the boot phase, the Thread Controller can initialize each thread PC through a specific interface.

Instruction Fetch Stage

The instruction cache is set associative and has two stages. Once an eligible thread is selected, Instruction Fetch reads its PC, and determines if the next instruction cache line is already in instruction cache memory or not. In the first stage each way has a bank of memory containing tag values and valid bits for the cache sets. This stage reads the way memories in parallel and passes those data to the second stage. The next stage tag memory has one cycle of latency, so the next stage handles the result. This stage compares the way tags read in the last stage, if they match, it is a cache hit. In this case, this stage issues the instruction cache data address to instruction cache data memory. If a miss occurs an instruction memory transaction is issued to the Network Interface and the thread is blocked until the instruction line is not retrieved from main memory.

Finally, this module handles the PC restoring in case of rollback. When a rollback occurs and the rollback signals are set by Rollback Handler stage, the Instruction Fetch module overwrites the PC of the thread that issued the rollback.

Thread & PC selection

A thread is selected by all the possible eligible ones using an external signal coming from thread controller unit. Anyway, an internal round robin arbiter raises the threads in a fair mode. A different thread is elected at each clock cycle, so nu+ can be classified as a fine-grained multithreaded architecture.

The elected thread number selects a specific PC that is modified on the base of some thread-related events: if there is not a cache miss or if there is not a rollback - else the valid signal is invalidated.

if ( tc_job_valid && thread_id == tc_job_thread_id ) // new job
   next_pc[thread_id] <= tc_job_pc;
else if ( rollback_valid[thread_id] ) // rollback
   next_pc[thread_id] <= rollback_pc_value[thread_id];
else if ( stage1_miss[thread_id] && stage1_thread_scheduled_id == thread_id ) // Inst miss
   next_pc[thread_id] <= next_pc[thread_id] - address_t'( 3'd4 );
else if ( thread_scheduled_bitmap[thread_id] )  // Normal execution
   next_pc[thread_id] <= next_pc[thread_id] + address_t'( 3'd4 );

Cache LRU

The Pseduo LRU works in a way described at this link, page 13.

The hit interface is enabled when you want to move a way to the MRU position, i.e. when a hit is performed.

THe update interface is enabled when you want to request LRU way to fill it. This happen when new instruction cahce line arrives from memory. No replacement is needed beacause the instruction memory area is uncoherent.

Tag and Data instruction cache

The tag and data cache are accessed in mode, using the same input. The caches are read-enabled only if the current instruction is valid. The result is validated only if there is one hit.

A line is registered as valid if some instruction are coming from memory for that way. The line_valid is registered per way too. It's important to notice a cut-through for validity check operation: if the instruction updating is relative to and address equals to the selected one, the valid output signal is instantly asserted.

if ( tc_valid_update_cache && way_lru == way )
   line_valid[tc_addr_update.index] <= 1;
if ( tc_valid_update_cache && way_lru == way && tc_addr_update.index == icache_address_selected.index)
   line_valid_selected[way]         <= 1; //cut-through
else
   line_valid_selected[way]         <= line_valid[icache_address_selected.index] & instruction_valid;

In order to have the signal about the current scheduled thread aligned with the available data at the caches output, we need to register this signals: PC, instruction_valid, line_valid, address, thread_id.

Hit/miss detection

There will be a miss only if the line is valid and the tags match. The hit signal is per way (NOT per thread)

for ( way = 0; way < `ICACHE_WAY; way++ ) 
    assign hit_miss[way] = tag_read_data[way] == stage1_icache_address.tag && line_valid_selected[way];

We will register the current thread causing load miss.

for ( thread_id = 0; thread_id < `THREAD_NUMB; thread_id++ )
    assign stage1_miss[thread_id] = ( stage1_thread_scheduled_id == thread_id ) ? ~|hit_miss & stage1_instruction_valid : 1'b0;

Output logic

The cache miss is asserted only when the current instruction is valid, there is no rollback and there is a miss on the current thread scheduled.

if_cache_miss = stage1_instruction_valid & ~rollback_valid[stage1_thread_scheduled_id] & stage1_miss[stage1_thread_scheduled_id];

The instruction valid is asserted only when the current instruction is valid, there is no rollback and there is not a miss on the current thread scheduled.

if_valid = stage1_instruction_valid & ~rollback_valid[stage1_thread_scheduled_id] & ~stage1_miss[stage1_thread_scheduled_id];

Decode stage

Decode stage decodes fetched instruction from Instruction Fetch and produces the control signals for the datapath directly from the instruction bits. Output dec_instr helps execution and control modules to manage the issued instruction and is propagated in each pipeline stage. Instruction type are presented in the ISA section.

The goal of this stage is to fill all the field of the dec_instr signal using just the feched instruction if_inst_scheduled. The if_inst_scheduled signal is composed from an opcode and a body. The instruction could be of 7 different exclusive type:

typedef union packed {
   RR_instruction_body_t RR_body;
   RI_instruction_body_t RI_body;
   MVI_instruction_body_t MVI_body;
   MEM_instruction_body_t MEM_body;
   MPOLI_instruction_body_t MPOLI_body;
   JBA_instruction_body_t JBA_body;
   JRA_instruction_body_t JRA_body;
   CTR_instruction_body_t CTR_body;
} instruction_body_t;

A big combinatorial case construct differentiates the dec_instr's field assignment using the opcode field. This could be seen and implemented like a big ROM.

Instruction scheduler stage

Fetched instructions are stored in FIFOs in this stage, one per thread. The Dynamic Scheduler checks data hazard and states which thread can be fetched in the Operand Fetch; this is done through a light scoreboarding system, each thread has its own scoreboard. There are no structural hazard check, it is done in Writeback stage.

Operand fetch stage

Operand Fetch prepares operands to the Execution pipeline (opf_ output signals). As said before, nu+ core supports SIMD operations, for this purpose it has two register files: a scalar register file (SRF) and a vector register file (VRF). A SRF register size is `REGISTER_SIZE bits (default 32), a VRF register size is scalar register file for each hardware lane (`REGISTER_SIZE x `HW_LANE, default 32 bit x 16 hw lane). Both, SRF and VRF, have same register number (`REGISTER_NUMBER define in nuplus_define.sv).

The register read interface receives as input the instruction scheduled and the result is pushed to second stage to calculate properly the operand 0 and 1. About the operand 0, in case of memory access, it holds the effective memory address adding the base address with the immediate offset, else it contains the read value.

if ( next_issue_inst_scheduled.is_memory_access ) 
   opf_fecthed_op0_buff <= {`HW_LANE{rd_out0_scalar + scal_reg_size_t'( next_issue_inst_scheduled.immediate )}};
else
   opf_fecthed_op0_buff <= {`HW_LANE{rd_out0_scalar}};

About the operand 1, if the current instruction has an immediate, this is replicated on each vector element of operand 1. Otherwise operand 1 holds the value from the required register file.

if ( next_issue_inst_scheduled.is_source1_immediate )
   opf_fecthed_op1_buff <= {`HW_LANE{next_issue_inst_scheduled.immediate}};
else
   opf_fecthed_op1_buff <= {`HW_LANE{rd_out1_scalar}};

Note that the PC is not stored in a real register, but moved stage by stage inside the fetched operation. This is the reason why it is not possible to read directly from the registers.

The register write interface write the data specified in the signal from the wirteback stage. In case of vectorial operation, the wb_result_hw_lane_mask handles which lane is affected by the current operation.

for ( lane_id = 0; lane_id < `HW_LANE; lane_id ++ ) begin : LANE_WRITE_EN
   assign write_en_byte[lane_id] = wb_result.wb_result_write_byte_enable & {( `BYTE_PER_REGISTER ){wb_result.wb_result_hw_lane_mask[lane_id] & wr_en_vector}};

Each thread has its own register file, this is done allocating a bigger SRAM (REGISTER_NUMBER x `THREAD_NUMB).

When a masked instruction is issued, register `MASK_REG (default scalar register $60) is stores in opf_fecthed_mask. When source 1 is immediate, its value is replied on each vector element. Memory access and branch operation require a base address. In both cases Decode module maps base address in source0.

Integer Arithmetic & Logic unit

This is the principal execution pipe. It execute: jumps, arithmetic and logic operations, control register accesses, and moves. For further details about the operation admitted, go to the ISA page.

In order to understand the correct operation, the kind of operation and the pipe selection fields inside the instruction decoded are used. For example:

assign is_jmpsr = opf_inst_scheduled.pipe_sel == PIPE_BRANCH & opf_inst_scheduled.is_int & opf_inst_scheduled.is_branch;

The vectorial operation are executed in parallel using the concept of lane: each lane performs a scalar operation and the final vectorial result vec_result is composed chaining all the scalar intermediate results from these lanes.

If an operation is scalar, only the first lane of vec_result makes sense. If the lane results has to be further modified, as pecific signal is redirected to the output. For example:

if( is_compare )
   int_result[0] <= cmp_result;
else if (is_shuffle)
   int_result    <= shuffle_result;
else
   int_result    <= vec_result;

Control register

This module returns 4 parameter: TILE_ID, CORE_ID, THREAD_ID, GLOBAL_ID. The parameters CORE_ID and GLOBAL_ID are significant only if other nu+ core can be supported in the same tile. By now, the control register returns just the information inside the instruction scheduled (in other words, the ID do not come from an internal register).

Scratchpad unit

This unit is described in the dedicated scratchpad page.

Load/Store unit

This unit is described in the dedicated load/store subsection inside the coherence section.

Floating point unit

A multistage floating point instructions, supports all basic FP operation according to the IEEE-754-2008 standard.

Barrier unit

This unit is described in the dedicated synchronization section.

Branch unit

This module handles conditional and unconditional jumps and restores scoreboards if the jump is taken. It signals to the Rollback Handler when a jump must be taken or not. Base address or condition are stored in opf_fecthed_op0[0], immediate is stored in opf_fecthed_op1[0].

Nu+ supports two jump instruction formats:

  • JRA: Jump Relative Address is an unconditional jump instruction, it takes an immediate and the core will always jump to PC + immediate location. E.g. jmp -12 -> BC will jump to PC-12 (3 instruction back) memory location.
  • JBA: Jump Base Address can be a conditional or unconditional jump, it takes a register and an immediate as input. In case of conditional jump, the input register holds the jump condition, if the condition is satisfied BC will jump to PC + immediate location. E.g. branch_eqz s4, -12 -> BC will jump if register s4 is equal zero to PC-12 location.

In case of unconditional jump, the input register is the effective address where to jump. E.g. jmp s4 -> BC will jump to memory location stored in s4.

Note these two signals:

assign bc_rollback_enable  = jump & opf_inst_scheduled.is_branch & opf_valid;
assign bc_rollback_valid   = opf_valid && opf_inst_scheduled.pipe_sel == PIPE_BRANCH && ~bc_rollback_enable;

The enable is asserted only if a jump is taken. The valid is asserted just if a branch operation is executed, regardless of the result.

Writeback stage

The writeback stage writes the results of various operations from the execution pipes in the internal registers. The execution pipelines have different lengths, so instructions issued in different cycles could arrive at the Writeback in the same cycle. Furthermore, due to collisions, a load/store to the scartchpad memory can have variable latency which is unknown at compile time, and this can result in an unpredictable structural hazard on Writeback.

Writeback Request FIFOs

The Writeback module can resolves collision on itself on-the-fly using a set of EX_PIPES queues: in each queue the corresponding result is stored and will be schedulet for writebacking in a rounr-robin manner. A queue stores all the information needed for a writeback operation.

Note that a writeback_request_fifo as the almost_full_threashold reduced by 4, equals to the distance from the first stage of the operand fetch stage in the worst case. This avoids operation-loss.

Result Composer

Depending on the operation, two tasks has to be done for each execution pipe (if needed):

  • to create a byte-grained register mask, in order to avoid the writing in undesiderable byte (e.g. a load_8 operation writes only in the first byte and not in all the 4-byte register);
  • to compose the final result, moving the bytes in the right positions and executing 8-bit/16-bit and 32-bit load sign extentions, as well.

For example;

assign byte_data_mem_s[j] = {{( `REGISTER_SIZE - 8 ){word_data_mem[j][7]} }, word_data_mem[j][7 : 0]};

Rollback handler

Rollback Handler restores PCs and scoreboards of the threads that issued a rollback. In case of jump or trap, the Brach module in the Execution pipeline issues a rollback request to this stage, and passes to the Rollback Handler the thread ID that issued the rollback, the old scoreboard and the PC to restore.

Furthermore, the Rollback Handler flushes all issued requests from the thread still in the pipeline. It use a clear_bitmap for each thread in this way:

  • each time a rollback is issued, the clear_bitmap starts from scratch;
if ( rollback_valid[thread_id] )
   clear_bitmap[thread_id] <= scoreboard_t'( 1'b0 );
  • each time an operation is issued, this is recorder in the scoreboard_temp mask:
scoreboard_temp  = clear_bitmap[thread_id] & ~( scoreboard_clear_int & {`SCOREBOARD_LENGTH{bc_rollback_valid}} ) | ( scoreboard_set_issue & {`SCOREBOARD_LENGTH{is_instruction_valid}} );
  • when a rollback is issued, the scoreboard_temp mask has recorder all the operations issued until here - but the jump operation -; so this mask goes to the instruction scheduler to clear the operation that has to be done

PER MIRKO: CORREGGI SE NECESSARIO

Note that the thread are completely independent.

Thread controller

Thread Controller handles eligible thread pool. This module blocks threads that cannot proceed due cache misses or scoreboarding. Dually, Thread Controller handles threads wake-up when the blocking conditions are no more trues.

Furthermore, the Thread Controller interfaces core instruction cache and the higher level in the memory hierarchy. Instruction miss requests are directly forwarded to the memory controller through the network on chip.

The third task performed is to accept the jobs from host interface and redirect them to the thread controller.

Thread controller

Note: a load/store miss blocks the corresponding thread until data is gather from main memory throughput the ib_fifo_full signal.