ISA

From NaplesPU Documentation
Jump to: navigation, search

Register File

The NPU register file is composed by a scalar register file and a vector register file; each one containing 64 registers.

The scalar register file has 64 registers. The first 58 are general purpose registers, while the remaining 8 are special purpose registers. Each scalar register can store up to 32 bits of data.


ScalarRegFile new.png


The vector register file has 64 general purpose registers Each vector register can store up to 512 bits of data, each vector can store 16 x 32 bits.

VectorRegFile new.png


Finally, there is a Control Register that is composed of several sub-registers. Some information are shared among all threads, others are thread specific and those registers marked 'thread' have a separate instance per thread.

Register Read/Write Shared/Thread Description ID
TILE_ID Read Shared Tile ID 0
CORE_ID Read Shared Core ID 1
THREAD_ID Read Thread ThreadID 2
GLOBAL_ID Read Thread Global ID, previous IDs merged as follow: TILE_ID, CORE_ID, THREAD_ID 3
GCOUNTER_LOW Read Shared Low part of the Global counter register which counts processor cycles since reset 4
GCOUNTER_HIGH Read Shared High part of the Global counter register which counts processor cycles since reset 5
THREAD_EN Read Shared Thread enabled mask, 1 bit per thread 6
MISS_DATA Read Shared Count of L1 Data cache misses 7
MISS_INSTR Read Shared Count of L1 Instruction cache misses 8
PC Read Thread Current PC 9
TRAP_REASON Read Thread Trap Cause (see below) 10
THREAD_STATUS Read/Write Thread Thread Status2 (see below) 11
ARGC Read/Write Shared The number of strings pointed to by argv 12
ARGV Read/Write Shared The address of command line arguments passed to main() 13
THREAD_NUMB Read Shared The number of total hardware threads 14
THREAD_MISS_CC Read Thread The per-thread clock cycles while the thread is idle due memory operations. 15
KERNEL_WORK Read Thread The per-thread kernel clock cycles. 16
CPU_CTRL_REG Read/Write Shared CPU mode register. At the moment only write policy used by the cache controller is implemented. 0 for write-back, 1 for write-through 17
UNCOHERENCE_MAP Read/Write Shared Address the non-coherent table in the control register. It stores information about the non-coherent memory regions. User can define non-coherent regions addressing this special purpose register. 19
DEBUG_BASE_ADDR Read/Write Shared Debug registers base address. The NPU is equipped with 16 debug registers. DEBUG_BASE_ADDR fetches the value of the first debug register, DEBUG_BASE_ADDR+1 the second and so on. 20

Trap Cause: in the current state only traps due to misaligned memory accesses can raise:

  1. SPM_ADDR_MISALIGN: Misaligned memory access in the SPM unit.
  2. LDST_ADDR_MISALIGN: Misaligned memory access in the LDST unit.

Thread Status: each thread can be in one of the following states:

  1. THREAD_IDLE (Value = 0): each thread starts in this state after reset.
  2. RUNNING (Value = 1): the thread is running a kernel.
  3. END_MODE (Value = 2): the thread switches in this mode when the issued kernel is completed.
  4. TRAPPED (Value = 3): the thread is in trap mode. At the current state, when a trap occurs, the thread jumps into an infinite loop.
  5. WAITING_BARRIER (Value = 4): the thread is waiting for a synchronization event.


Data Types

The following table sums up the data types that are possible to use in NPU core. The Type column has the C/C++ type names, the LLVM type column presents the type names used in LLVM and the Register column shows the register type in which a value of a specific type is stored.

The highlighted types are those the architecture natively supports, given the register files width. The others are obtained through extension so that they can be seen as the supported ones. Their advantage resides in more efficient use of the system memory.

Type LLVM Type Register Notes
bool i1 scalar (32 bits) It is expanded to 32 bits
char i8 scalar (32 bits) It is expanded to 32 bits
short i16 scalar (32 bits) It is expanded to 32 bits
int i32 scalar (32 bits)
float f32 scalar (32 bits)
vec16i8, vec16u8 v16i8 vector (16 x 32 bits) It is expanded to 32 bits vector
vec16i16, vec16u16 v16i16 vector (16 x 32 bits) It is expanded to 32 bits vector
vec16i32, vec16u32 v16i32 vector (16 x 32 bits)
vec16f32 v16f32 vector (16 x 32 bits)
vec8i8, vec8u8 v8i8 vector (8 x 32 bits) It is expanded to 32 bits vector
vec8i16, vec8u16 v8i16 vector (8 x 32 bits) It is expanded to 32 bits vector
vec8i32, vec8u32 v8i32 vector (8 x 32 bits) It is expanded to 32 bits vector
vec8f32 v8f32 vector (16 x 32 bits) It is considered as a 16 elements vector

Instructions Format

The NaplesPU instructions have a fixed length of 32 bits. They are grouped in six types:

  • The R type includes the logical and arithmetic operations and memory operations.
  • The I type includes the logical and arithmetic operations between a register operand and an immediate operand.
  • The MOVEI type includes the load operations of an immediate operand in a register.
  • The C type used for control operations and for synchronization instructions.
  • The J type includes jump instructions.
  • The M type includes the instructions used to access memory.

ISA

R type instructions

This is the format of the R-type instruction encoded in machine code.

RR format.png

  • RR (Register to Register) has a destination register and two source registers.
  • RI (Register Immediate) has a destination register and one source registers and an immediate encoded in the instruction word.

The fields of the R-type instruction are:

  • opcode (B29-24) is short for "operation code". The opcode is a binary encoding for the instruction. For R-type instructions, it is only 6 bits.
  • rd (B23-18) is the destination register
  • rs0 (B17-12) is the first source register.
  • rs1 (B11-6) is the second source register.
  • bit l (B4) is used in case of "long" operations, i.e. operations that require long integers or double precision numbers. If the operation requires 64-bit registers l=1, otherwise l=0.
  • bits fmt (B3-1) are used to specify if a certain operand is a scalar or a vector (one bit for every register in the format). B3 refers to register d, B2 refers to register rs0 and B1 refers to register rs1. For instance, if the destination register should contain a vector, B3=1, otherwise B3=0.

The R-type instructions are:

or 1 or Rb
and 2 and Rd = Ra & Rb
xor 3 xor Rd = Ra ^ Rb
add 4 addition Rd = Ra + Rb
sub 5 subtraction Rd = Ra – Rb
mullo 6 low result of the multiplication Rd = Ra * Rb
mulhi 7 high result of the multiplication Rd = Ra * Rb
mulhu 8 unsigned high result of the multiplication Rd = Ra * Rb
ashr 9 arithmetic shift right Rd = Ra '>> Rb
shr 10 shift right Rd = Ra >> Rb
shl 11 shift left Rd = Ra << Rb
clz 12 count leading zeros
ctz 13 count trailing zeros
shuffle 24 vector shuffle Rd[i] = Ra[Rb[i]]
getlane 25 Get lane from vector Rd = Ra[Rb]
move 32 move register Rd = Ra
fadd 33 floating point add Rd = Ra + Rb
fsub 34 floating point sub Rd = Ra – Rb
fmul 35 floating point multiplication Rd = Ra * Rb
fdiv 36 floating point division Rd = Ra / Rb
sext8 43 sign extend 8 bits
sext16 44 sign extend 16 bits
sext32 45 sign extend 32 bits
i32tof32 48 cast integer to float
f32toi32 49 cast float to integer
cmpeq 14 compare equal Rd = Ra == Rb
cmpne 15 compare not equal Rd = Ra != Rb
cmpgt 16 compare greater then Rd = Ra > Rb
cmpge 17 compare greater or equal Rd = Ra >= Rb
cmplt 18 compare less then Rd = Ra < Rb
cmple 19 compare less or equal Rd = Ra <= Rb
cmpugt 20 unsigned compare greater then Rd = Ra > Rb
cmpuge 21 unsigned compare greater or equal Rd = Ra >= Rb
cmpult 22 unsigned compare less then Rd = Ra < Rb
cmpule 23 unsigned compare less or equal Rd = Ra <= Rb
cmpfeq 37 floating point compare equal Rd = Ra == Rb
cmpfne 38 floating point compare not equal Rd = Ra != Rb
cmpfgt 39 floating point compare greater then Rd = Ra > Rb
cmpfge 40 floating point compare greater or equal Rd = Ra >= Rb
cmpflt 41 floating point compare less then Rd = Ra < Rb
cmpfle 42 floating point compare less or equal Rd = Ra <= Rb

I type instructions

This is the format of the I-type instruction encoded in machine code.

I instr.png

The fields of the I-type instruction are: opcode (B28-24) is short for "operation code". The opcode is a binary encoding for the instruction. For * I-type instructions, it is only 5 bits.

  • rd (B23-18) is the destination register
  • rs (B17-12) is the first source register.
  • imm (B11-3) is the 9-bit immediate.
  • fmt (B2-1) bits are used to specify if a certain operand is a scalar or a vector (one bit for every register in the format). B2 refers to register d and B1 refers to register rs.

The I-type instructions are:

Mnemonic Opcode Meaning Operation
ori 1 or Imm
andi 2 and Rd = Ra & Imm
xori 3 xor Rd = Ra ^ Imm
addi 4 addition Rd = Ra + Imm
subi 5 subtraction Rd = Ra – Imm
mulli 6 multiplication Rd = Ra * Imm
mulhi 7 high multiply Rd = Ra * Imm
mulhui 8 high multiply unsigned Rd = Ra * Imm
ashri 9 arithmetic shift right Rd = Ra ‘>> Imm
shri 10 shift right Rd = Ra >> Imm
shli 11 shift left Rd = Ra << Imm
getlane 25 Get lane from vector Rd = Ra[Imm]

MOVEI type instructions

MVI (Move Immediate) has a destination register and a 16-bit instruction encoded immediate. This is the format of the MOVEI-type instruction encoded in machine code.

MOVEI format.png

The fields of the MOVEI-type instruction are:

  • opcode (B26-24) is short for "operation code". The opcode is a binary encoding for the instruction. For MOVEI-type instructions, it is only 3 bits.
  • rd (B23-18) is the destination register
  • imm (B17-2) is the the 16-bit immediate.
  • fmt (B1) is used to specify if the destination register contains a vector or a scalar.

The MOVEI-type instructions are:

Mnemonic Opcode Meaning Operation
moveil 0 move the 16 less significant bits Rd = Ra & 0xFFFF
moveih 1 move the 16 most significant bits Rd = (Ra >> 16) & 0xFFFF
movei 2 move the 16 less significant bits with zero extension Rd = (Rd ^ Rd) & (Ra & 0xFFFF)

C type instructions

This is the format of the C-type instruction encoded in machine code.

C format.png

The fields of the C-type instruction are:

  • opcode (B26-24) is short for "operation code". The opcode is a binary encoding for the instruction. For C-type instructions, it is only 3 bits.
  • rs0 (B23-18) is the first source register.
  • rs1 (B17-12) is the second source register.

The C-type instructions are:

Mnemonic Opcode Meaning
barrier_core 0 Memory Barrier - ensure that all explicit data memory transfers before the barrier are completed before any subsequent explicit data memory transactions starting after the barrier. Register rs0 contains the barrier identification number (BID). BID can be an arbitrary number greater than 0, i.e. BID>0. Different memory barriers require different BIDs. rs1 contains the number of threads that should synchronize.
flush 2 Flush a cache line to the main memory.
read_cr 3 Read a sub-register of the control register.
write_cr 4 Write into a sub-register of the control register
dcache_inv 5 Invalidates the input address line in the L1 cache.

J type instructions

This is the format of the J-type instruction encoded in machine code.

J format.png

The fields of the J-type instruction are:

  • opcode (B26-24) is short for "operation code". The opcode is a binary encoding for the instruction. For J-type instructions, it is only 3 bits.
  • rcond/rd (B23-18) is the condition/destination register.
  • offset (B17-0) is the offset address.

The J-type instructions are:

Mnemonic Opcode Meaning Operation
jmp 0 jump - unconditionally jump to a specified location. PC=rd or PC=PC+offset
jmpsr 1 jump to subroutine - unconditionally jump to a specified location and store the return address in the RA register. RA=PC+4 PC=rd or RA=PC+4 PC=PC+addr
jret 3 Return from Subroutine - unconditionally return from a subroutine loading the return address from the RA register. PC=RA
beqz 5 Conditional Branch. Branch if Equal to Zero - branche to PC+offset if the contents of the condition register is equal to zero. if(rcond==0) PC=PC+offset else PC=PC+4
bnez 6 Conditional Branch, Branch if Not Equal to Zero - branches to PC+offset if the contents of the condition register is not equal to zero. if(rcond!=0) PC=PC+offset else PC=PC+4

M type instructions

This is the format of the M-type instruction encoded in machine code.

M format new.png

The fields of the M-type instruction are:

  • opcode (B29-24) is short for "operation code". The opcode is a binary encoding for the instruction. For M-type instructions, it is only 6 bits.
  • rd/rs (B23-18) is the destination or source register
  • rbase (B17-12) is the base address register.
  • offset (B11-3) is the offset address.
  • bit l (B2) not used. Reserved for 64-bit extension.
  • bit s (B1) is used to specify if a certain load/store memory operation goes to the scratchpad memory or not. For instance, in case of a load/store from/to the scratchpad memory, B1=1, otherwise B1=0.

The typical M type instructions are load and store instructions. In both cases, the source/destination address is calculated as base register address + immediate offset, i.e. rbase + offset. In case of load, rd = [rbase+offset]. Similarly, in case of store, [rbase + offset] = rs. All M type instructions can be used for both memory operations to the main memory and the scratchpad memory. Instructions that operate with the scratchpad memory have the _scratchpad suffix. E.g load32_s8 targets the main memory, while load32_s8_scratchpad refers to a load operation for the on-chip scratchpad.

The M-type instructions can be classified in scalar and vector instructions. The scalar M-type instructions are:

Mnemonic Opcode Meaning Operation
load32_s8 0 load memory byte [7:0] with sign extension into a 32 bit register Rd = [Rbase + Offset]
load32_s16 1 oad memory half word [15:0] with sign extension into a 32 bit register Rd = [Rbase + Offset]
load32 2 load memory word into a 32 bit register Rd = [Rbase + Offset]
load32_u8 4 load memory byte [7:0] with zero extension into a 32 bit register Rd = [Rbase + Offset]
load32_u16 5 load memory half word [15:0] with zero extension into a 32 bit register Rd = [Rbase + Offset]
load_v16i8 7 load 16 byte [127:0] with sign extension into a 512 bit register Rd = [Rbase + Offset]
load_v16i16 8 load 16 half word [255:0] with sign extension Rd = [Rbase + Offset]
load_v16i32 9 load 16 words Rd = [Rbase + Offset]
load_v16u8 11 load 16 byte [127:0] with no sign extension Rd = [Rbase + Offset]
load_v16u16 12 load 16 half word [255:0] with no sign extension Rd = [Rbase + Offset]
load_v8u32 13 load 8 word [255:0] with no sign extension Rd = [Rbase + Offset]
loadg32 16 load 16 words from different memory addresses (only for scratchpad) Rd[i] = [Rbase[i]]
store32_8 32 store 1 byte into the effective address [Rbase + Offset] = Rs
store32_16 33 store 2 bytes into the effective address [Rbase + Offset] = Rs
store32 34 store 1 word into the effective address [Rbase + Offset] = Rs
store_v16i8 36 store 16 bytes from a vectorial register (data fecthing from register schema [487:480,...,39:32,7:0]) into effective address location [Rbase + Offset] = Rs
store_v16i16 37 store 16 half words (data fetching from register schema [495:480,...,47:32,15:0]) into effective address location [Rbase + Offset] = Rs
store_v16i32 38 store 16 words from a vectorial register into effective address location [Rbase + Offset] = Rs
stores32 42 scatter store - store 16 words into 16 different addresses (only for scratchpad) [Rbase[i]] = Rs[i]