Difference between revisions of "ISA"

From NaplesPU Documentation
Jump to: navigation, search
Line 1: Line 1:
 +
TODO: rivalidare (sicuramente manca dcacheinv)
 +
 
= Register File =
 
= Register File =
 
The nu+ register file is  composed by a '''scalar''' register file and a '''vector''' register file; each one containing 64 registers.  
 
The nu+ register file is  composed by a '''scalar''' register file and a '''vector''' register file; each one containing 64 registers.  

Revision as of 18:23, 31 December 2018

TODO: rivalidare (sicuramente manca dcacheinv)

Register File

The nu+ register file is composed by a scalar register file and a vector register file; each one containing 64 registers.

The scalar register file has 64 registers. The first 58 are general purpose registers, while the remaining 8 are special purpose registers. Each scalar register can store up to 32 bits of data. However the nu+ architecture can support also 64 bits of data, storing it in a couple of contiguous registers.


ScalarRegFile.png


The vector register file has 64 general purpose registers Each vector register can store up to 512 bits of data. Each vector can store 16 x 32 bits or 8 x 64 bits of data.

VectorRegFile.png


Finally, there is a Control Register that is composed of several sub-registers. Some information are shared among all threads, others are thread specific and those registers marked 'thread' have a separate instance per thread.

Register Read/Write Shared/Thread Description ID
TILE_ID Read Shared Tile ID 0
CORE_ID Read Shared Core ID 1
THREAD_ID Read Thread ThreadID 2
GLOBAL_ID Read Thread Global ID, previous IDs merged as follow: TILE_ID, CORE_ID, THREAD_ID 3
GCOUNTER_LOW Read Shared Low part of the Global counter register which counts processor cycles since reset 4
GCOUNTER_HIGH Read Shared High part of the Global counter register which counts processor cycles since reset 5
THREAD_EN Read Shared Thread enabled mask, 1 bit per thread 6
MISS_DATA Read Shared Count of L1 Data cache misses 7
MISS_INSTR Read Shared Count of L1 Instruction cache misses 8
PC Read Thread Current PC 9
TRAP_REASON Read Thread Trap Cause (see below) 10
THREAD_STATUS Read/Write Thread Thread Status2 (see below) 11
ARGC Read/Write Shared The number of strings pointed to by argv 12
ARGV Read/Write Shared The address of command line arguments passed to main() 13
THREAD_NUMB Read Shared The number of total hardware threads 14
THREAD_MISS Read Thread The per-thread number of data cache miss 15
CACHE_MODE Read/Write Shared The write policy used by the cache controller. 0 for write-back, 1 for write-through 16

Trap Cause: in the current state only traps due misaligned memory accesses can raise:

  1. SPM_ADDR_MISALIGN: Misaligned memory access in the SPM unit.
  2. LDST_ADDR_MISALIGN: Misaligned memory access in the LDST unit.

Thread Status: each thread can be in one of the following states:

  1. THREAD_IDLE (Value = 0): each thread starts in this state after reset.
  2. RUNNING (Value = 1): the thread is running a kernel.
  3. END_MODE (Value = 2): the thread switches in this mode when the issued kernel is completed.
  4. TRAPPED (Value = 3): the thread is in trap mode. At the current state, when a trap occurs, the thread jumps into an infinite loop.
  5. WAITING_BARRIER (Value = 4): the thread is waiting for a synchronization event.


Data Types

The following table sums up the data types that are possible to use in nu+. The Type column has the C/C++ type names, the LLVM type column presents the type names used in LLVM and the Register column shows the register type in which a value of a specific type is stored.

The highlighted types are those the architecture natively supports, given the register files width. The others are obtained through extension, so that they can be seen as the supported ones. Their advantage resides in a more efficient use of the system memory.

Type LLVM Type Register Notes
bool i1 scalar (32 bits) It is expanded to 32 bits
char i8 scalar (32 bits) It is expanded to 32 bits
short i16 scalar (32 bits) It is expanded to 32 bits
int i32 scalar (32 bits)
float f32 scalar (32 bits)
long long int i64 scalar (64 bits)
double f64 scalar (64 bits)
vec16i8, vec16u8 v16i8 vector (16 x 32 bits) It is expanded to 32 bits vector
vec16i16, vec16u16 v16i16 vector (16 x 32 bits) It is expanded to 32 bits vector
vec16i32, vec16u32 v16i32 vector (16 x 32 bits)
vec16f32 v16f32 vector (16 x 32 bits)
vec8i8, vec8u8 v8i8 vector (8 x 64 bits) It is expanded to 64 bits vector
vec8i16, vec8u16 v8i16 vector (8 x 64 bits) It is expanded to 64 bits vector
vec8i32, vec8u32 v8i32 vector (8 x 64 bits) It is expanded to 64 bits vector
vec8f32 v8f32 vector (16 x 32 bits) It is considered as a 16 elements vector
vec8i64, vec8u64 v8i64 vector (8 x 64 bits)
vec8f64 v8f64 vector (8 x 64 bits)

Instructions Format

The nu+ instructions have a fixed length of 32 bits. They are grouped in seven types:

  • The R type includes the logical and arithmetic operations and memory operations.
  • The I type includes the logical and arithmetic operations between a register operand and an immediate operand.
  • The MOVEI type includes the load operations of an immediate operand in a register.
  • The C type used for control operations and for synchronization instructions.
  • The J type includes jump instructions.
  • The M type includes the instructions used to access memory.

ISA

R type instructions

  • RR (Register to Register) has a destination register and two source registers.
  • RI (Register Immediate) has a destination register and one source registers and an immediate encoded in the instruction word.
or 1 or Rb
and 2 and Rd = Ra & Rb
xor 3 xor Rd = Ra ^ Rb
add 4 addition Rd = Ra + Rb
sub 5 subtraction Rd = Ra – Rb
mull 6 multiplication Rd = Ra * Rb
mulhs 7 high multiply Rd = Ra * Rb
mulhu 8 high multiply unsigned Rd = Ra * Rb
ashr 9 arithmetic shift right Rd = Ra ‘>> Rb
shr 10 shift right Rd = Ra >> Rb
shl 11 shift left Rd = Ra << Rb
clz 12 count leading zeros
ctz 13 count trailing zeros
shuffle 24 vector shuffle Rd[i] = Ra[Rb[i]]
getlane 25 Get lane from vector Rd = Ra[Rb]
move 32 move register Rd = Ra
fadd 33 floating point add Rd = Ra + Rb
fsub 34 floating point sub Rd = Ra – Rb
fmul 35 floating point multiplication Rd = Ra * Rb
fdiv 36 floating point division Rd = Ra / Rb
sext8 43 sign extend 8 bits
sext16 44 sign extend 16 bits
sext32 45 sign extend 32 bits
f32tof64 46 cast float to double
f64tof32 47 cast double to float
i32tof32 48 cast integer to float
f32toi32 49 cast float to integer
cmpeq 14 compare equal Rd = Ra == Rb
cmpne 15 compare not equal Rd = Ra != Rb
cmpgt 16 compare greater then Rd = Ra > Rb
cmpge 17 compare greater or equal Rd = Ra >= Rb
cmplt 18 compare less then Rd = Ra < Rb
cmple 19 compare less or equal Rd = Ra <= Rb
cmpugt 20 unsigned compare greater then Rd = Ra > Rb
cmpuge 21 unsigned compare greater or equal Rd = Ra >= Rb
cmpult 22 unsigned compare less then Rd = Ra < Rb
cmpule 23 unsigned compare less or equal Rd = Ra <= Rb
cmpfeq 37 floating point compare equal Rd = Ra == Rb
cmpfne 38 floating point compare not equal Rd = Ra != Rb
cmpfgt 39 floating point compare greater then Rd = Ra > Rb
cmpfge 40 floating point compare greater or equal Rd = Ra >= Rb
cmpflt 41 floating point compare less then Rd = Ra < Rb
cmpfle 42 floating point compare less or equal Rd = Ra <= Rb

I type instructions

Mnemonic Opcode Meaning Operation
ori 1 or Imm
andi 2 and Rd = Ra & Imm
xori 3 xor Rd = Ra ^ Imm
addi 4 addition Rd = Ra + Imm
subi 5 subtraction Rd = Ra – Imm
mulli 6 multiplication Rd = Ra * Imm
mulhi 7 high multiply Rd = Ra * Imm
mulhui 8 high multiply unsigned Rd = Ra * Imm
ashri 9 arithmetic shift right Rd = Ra ‘>> Imm
shri 10 shift right Rd = Ra >> Imm
shli 11 shift left Rd = Ra << Imm
getlane 25 Get lane from vector Rd = Ra[Imm]

MOVEI type instructions

MVI (Move Immediate) has a destination register and a 16 bit instruction encoded immediate.


Mnemonic Opcode Meaning Operation
moveil 0 move the 16 less significant bits Rd = Ra & 0xFFFF
moveih 1 move the 16 most significant bits Rd = (Ra >> 16) & 0xFFFF
movei 2 move the 16 less significant bits with zero extension Rd = (Rd ^ Rd) & (Ra & 0xFFFF)

C type instructions

Mnemonic Opcode Meaning
barrier_core 0 Memory Barrier - ensure that all explicit data memory transfers before the barrier are completed before any subsequent explicit data memory transactions starting after the barrier. Register rs0 contains the barrier identification number (BID). BID can be an arbitrary number greater than 0, i.e. BID>0. Different memory barriers require different BIDs. rs1 contains the number of threads that should synchronize.
flush 2 Flush a cache line to the main memory.
read_cr 3 Read a sub-register of the control register.
write_cr 4 Write into a sub-register of the control register

J type instructions

Mnemonic Opcode Meaning Operation
jmp 0 jump - unconditionally jump to a specified location. PC=rd or PC=PC+offset
jmpsr 1 jump to subroutine - unconditionally jump to a specified location and store the return address in the RA register. RA=PC+4 PC=rd or RA=PC+4 PC=PC+addr
jret 3 Return from Subroutine - unconditionally return from a subroutine loading the return address from the RA register. PC=RA
beqz 5 Conditional Branch. Branch if Equal to Zero - branche to PC+offset if the contents of the condition register is equal to zero. if(rcond==0) PC=PC+offset else PC=PC+4
bnez 6 Conditional Branch, Branch if Not Equal to Zero - branche to PC+offset if the contents of the condition register is not equal to zero. if(rcond!=0) PC=PC+offset else PC=PC+4

M type instructions

MEM (Memory Instruction) has a destination/source field, in case of load the first register asses the destination register, otherwise in case of store the first register contains the store value. Next in both cases there is the base address and the immediate. The sum of base address and immediate will give the effective memory address.

Mnemonic Opcode Meaning Operation
loadXD_s8 0 load 1 byte with sign extension Rd = [Rbase + Offset]
loadXD_s16 1 load 2 bytes with sign extension Rd = [Rbase + Offset]
load32D 2 load 1 word Rd = [Rbase + Offset]
loadXD_u8 4 load 1 byte with zero extension Rd = [Rbase + Offset]
loadXD_u16 5 load 2 bytes with zero extension Rd = [Rbase + Offset]
load64D_s32 2 load 1 word sign-extended to 1 double-word Rd = [Rbase + Offset]
load64D_u32 6 load 1 word zero-extended to 1 double-word Rd = [Rbase + Offset]
load64D 3 load 1 double-word Rd = [Rbase + Offset]
loadD_vYi8 7 load a vector of Y bytes with sign extension Rd = [Rbase + Offset]
loadD_vYi16 8 load a vector of Y 2 bytes with sign extension Rd = [Rbase + Offset]
loadD_vYi32 9 load a vector of Y words with sign extension Rd = [Rbase + Offset]
loadD_v8i64 10 load a vector of 8 double-words Rd = [Rbase + Offset]
loadD_vYu8 11 load a vector of Y bytes with zero extension Rd = [Rbase + Offset]
loadD_vYu16 12 load a vector of Y 2 bytes with zero extension Rd = [Rbase + Offset]
loadD_vYu32 13 load a vector of Y words with zero extension Rd = [Rbase + Offset]
loadD_g_32 16 load 16 words from different memory addresses Rd[i] = [Rbase[i]]
storeXD_8 32 store 1 byte [Rbase + Offset] = Rs
storeXD_16 33 store 2 bytes [Rbase + Offset] = Rs
store32D 34 store 1 word [Rbase + Offset] = Rs
store64D_32 34 store 1 word [Rbase + Offset] = Rs
store64D 35 store 1 double-word [Rbase + Offset] = Rs
storeD_vYi8 32 store Y bytes [Rbase + Offset] = Rs
storeD_vYi16 33 store Y 2 bytes [Rbase + Offset] = Rs
storeD_vYi32 34 store Y words [Rbase + Offset] = Rs
storeD_v8i64 35 store Y double-words [Rbase + Offset] = Rs
storeD_s_32 42 store 16 words to different memory addresses [Rbase[i]] = Rs[i]