NaplesPU LLVM Documentation

From NaplesPU Documentation
Revision as of 15:37, 21 June 2019 by Francesco (talk | contribs) (Francesco moved page Nu+ LLVM Documentation to NaplesPU LLVM Documentation)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The main task of the backend is to generate NPU assembly code from the LLVM IR obtained by the Clang frontend. It also handles object representation of classes needed to create the assembler and the disassembler. The NPU backend is contained in the NaplesPU folder under "compiler/lib/Target" directory. It contains several files, each implementing a specific class of the LLVM Framework.

An LLVM backend is constructed on two types of file, C++ and TableGen source files. Refer to section TableGen to get a detailed explanation of the latters.

Required reading

Before working on LLVM, you should be familiar with some things. In particular:

  1. Basic Blocks
  2. SSA (Static Single Assignment) form
  3. AST (Abstract Syntax tree)
  4. DAG Direct Acyclic Graph.

In addition to general aspects about compilers, it is recommended to review the following topics:

  1. LLVM architecture
  2. LLVM Intermediate Representation

See the following textbook for other information Getting Started with LLVM Core Libraries and LLVM Cookbook.

See also this article to get an overview of the main CodeGenerator phases.

TableGen

TableGen is a record-oriented language used to describe the target-specific information. It is written by the LLVM team in order to simplify the back-end development and to avoid potential code redundancy. For example, by using TableGen, if some feature of the target-specific register file changes, you do not need to modify different files wherever the register appears but you need only to modify the .td file that contains its definition. Actually, the TableGen is used to define instruction formats, instructions, registers, pattern-matching DAGs, instruction selection matching order, calling conventions, and target platform properties.

For other information, check the TableGen Documentation

Backend Description

This section shows how is implemented the backend support for NPU within LLVM.

Target Definition

The target-specific information is explained in TableGen files. The custom target is defined by creating a new NaplesPU.td file, in which the target itself is described. This file contains the implementation of the target-independent interfaces provided by Target.td. Implementations are done by using the class inheritance mechanism.

The code below is the Target class definition that should be implemented in NaplesPU.td.

class Target {
  InstrInfo InstructionSet;
  list<AsmParser> AssemblyParsers = [DefaultAsmParser];
  list<AsmParserVariant> AssemblyParserVariants = [DefaultAsmParserVariant];
  list<AsmWriter> AssemblyWriters = [DefaultAsmWriter];
}

This file should also include the other defined .td target-related files. The target definition is done as follows:

def : Processor<"naplespu", NoItineraries, []>;

where Processor is a class defined in Target.td: class Processor<string n, ProcessorItineraries pi, list<SubtargetFeature> f>

where:

  • n is the chipset name, used in the command line option -mcpu to determine the appropriate chip.
  • p is the processor itinerary, as described in the theoretical description of the LLVM instruction scheduling phase. NoItinerary means that no itinerary is defined.
  • f is a list of target features.

Registers Definition

The target-specific registers are defined in NaplesPURegisterInfo.td. LLVM provides two ways to define a register, both of them declared in Target.td. The first one is used to define a simple scalar register and follows the declaration below:

class Register<string n, list<string> altNames = []>

where n is the register name, while altNames is a list of register alternative names.

The second method to define a register is to inherit from the class declared below:

class RegisterWithSubRegs<string n, list<Register> subregs>

This second way is used when it is required to define a register that is a collection of n sub-registers. There is also a third way to define registers. It consists of defining a super-register, that is a pseudo-register resulting of the combination of other sub-registers. It is useful when there is no architectural support for larger registers:

class RegisterTuples<list<SubRegIndex> Indices, list<dag> Regs>

For example, the following code is the declaration of the register class:

class NaplesPUGPRReg<string n> : Register <n>;

At this point, registers can be instantiated as follows:

foreach i = 0-57 in {
    def S#i : NaplesPUGPRReg<"s"#i>, DwarfRegNum<[i]>;
 }
 ...
def SP_REG : NaplesPUGPRReg<"sp">, DwarfRegNum<[61]>; //stack pointer
...
foreach i = 0-63 in {
    def V#i : NaplesPUGPRReg<"v"#i>, DwarfRegNum<[!add(i, 64)]>;
}

The instantiation reflects the custom target hardware architecture, with a set of scalar registers and a set of vectorial ones. Each register inherits also from theDwarfRegNum assigning to it an incremental number. This is useful for the internal identification of registers, consistent with the DWARF standard.

Now that registers are defined they must belong to classes in order to define the allocation of them.

LLVM provides the RegisterClass defined as below:

class RegisterClass<string namespace, list<ValueType> regTypes, int alignment, dag regList>

in which:

  • namespace is the namespace associate to it;
  • regTypes is a list of ValueType values indicating the types of variables that can be allocated into them;
  • alignment is the alignment associated to the registers in the RegisterClass;
  • regList is the list of registers that belong to the class. Since the parameter type is dag, TableGen provides operands to define a set of registers in terms of a set of operators.

For example, to match our target features it is necessary to define two classes:

  • GPR32, that is the abstraction of 32-bit wide scalar registers;
  • VR512W, that are the abstraction of 512-bit wide vectorial registers;
def GPR32 : RegisterClass<"NaplesPU", [i32, f32, i64, f64], 32, (add (sequence "S%u", 0, 57),
  TR_REG, MR_REG, FP_REG, SP_REG, RA_REG, PC_REG)>;
def VR512W : RegisterClass<"NaplesPU", [v16i32, v16f32, v16i8, v16i16], 512, (sequence "V%u", 0, 63)>;

Calling Convention

This section shows how to define the calling convention, that is how parameters are passed to sub-functions, and how the return value is sent back to the caller.

The calling convention is defined in the NaplesPUCallingConv.td by using the classes defined in the TargetCallingConv.td file.

In our purposes, it's required to define the calling convention in terms of the registers used to pass the arguments to the callee. LLVM provides the CallingConv class defined below:

class CallingConv<list<CCAction> actions>

This class requires a list of CCAction. TargetCallingConv.td contains a set of CCAction derived classes that must be used to define the sub-function calling behaviour.

The calling convention for the custom target device is defined by using the first eight registers, and then the stack for the remaining parameters. It means that for 32-bit variables, they are passed to the callee by using the registers Si, where i = 0..7. The same schema is adopted for vectorial variables.

The calling convention should also take care of the passing of type that is not natively supported by the target. In this case, the solution adopted is the type promotion. The mechanism is simple: it consists of promoting the unsupported type to a supported one. It can be easily implemented by only using the CCAction classes provided by LLVM. By using the just described approach, i1, i8, i16 are promoted to i32 while v16i8, v16i16 are promoted to v16i32.

def CC_NaplesPU : CallingConv<[
  CCIfType<[i1, i8, i16], CCPromoteToType<i32>>,
  CCIfType<[v16i8, v16i16], CCPromoteToType<v16i32>>,
  CCIfNotVarArg<CCIfType<[i32, f32], CCAssignToReg<[S0, S1, S2, S3, S4, S5, S6, S7]>>>,
  CCIfNotVarArg<CCIfType<[v16i32, v16f32], CCAssignToReg<[V0, V1, V2, V3, V4, V5, V6, V7]>>>,
  CCIfType<[i32, f32], CCAssignToStack<4, 4>>,
  CCIfType<[v16i32, v16f32], CCAssignToStack<64, 64>>
]>;

The calling convention for NPU in terms of results returning mechanism is realized by passing them in the first six registers.

It could be also possible that the return type does not correspond to any native target type. The solution is promoting.

def RetCC_NaplesPU32 : CallingConv<[
CCIfType<[i1, i8, i16], CCPromoteToType<i32>>,
CCIfType<[i32, f32], CCAssignToReg<[S0, S1, S2, S3, S4, S5]>>>

LLVM also provides a mechanism to define the registers that the callee must save before starting the function execution. The mechanism consists in defining an instance of the class CalleeSavedRegs:

class CalleeSavedRegs<dag saves>

where saves is the list of registers to be saved.

For example, in NPU, the callee saved registers are defined as follows:

def NaplesPUCSR : CalleeSavedRegs<(add (sequence "S%u", 50, 57), MR_REG, FP_REG, RA_REG, (sequence "V%u", 56, 63))>;

ISA Support

In order to define the LLVM support for NPU, the next step is to handle the target-specific code generation by implementing its ISA (Instruction Set Architecture) in LLVM. In order to describe it, TableGen provides the class Instruction located in Target.td.

class Instruction {
  string Namespace = "";
  dag OutOperandList;       
  dag InOperandList;     
  string AsmString = ""; 
  list<dag> Pattern;
  list<Register> Uses = []; 
  list<Register> Defs = [];
  list<Predicate> Predicates = [];
  int Size = 0;
  bit isReturn     = 0;     
  bit isBranch     = 0;    
  bit isPseudo     = 0;    
  bit isCodeGenOnly = 0;
  bit isAsmParserOnly = 0;
  ...
}

As it is clear, the Instruction class provides a lot of member variable to describe an instruction. The following are a subset that is used for our purposes:

  • OutOperandList is the set of output operands of the instruction, defined as of dag type.
  • InOperandList is the set of input operands of the instruction, defined as of dag type.
  • AsmString is the.s format to print the instruction with.
  • isReturn defines the instruction as a return one.
  • isBranch defines the instruction as a branch one, that is, it breaks the sequential control flow.
  • isPseudo defines the instruction as a pseudo one, or rather an instruction for which there is not a corresponding machine instruction.
  • isCodeGenOnly, isAsmParserOnly and usesCustomInserter are also defining a pseudo-instruction, but with different behaviour. The first field is describing a pseudo-instruction that is used for CodeGen modeling purposes while the second one is describing a pseudo-instruction used by the assembler parser. The third one tells that the pseudo-instruction needs to be manually lowered to a machine instruction.
  • Uses and Defs are lists of registers that are respectively read and written by the instruction.
  • Predicates are logical expressions that must be checked in the Instruction Selection phase.
  • Pattern is a list of dag, each of them explains a DAG Pattern for the instruction used by the instruction selector to turn a set of SelectionDAG} nodes in the corresponding target-specific instruction.

In order to handle the target platform ISA complexity it has been adapted a hierarchical schema of classes, each one represents a specific instruction format.

First of all it is defined a generic format for the target-specific instruction:

class InstNaplesPU<dag outs, dag ins, string asmstr, list<dag> pattern>
          : Instruction {
  field bits<32> Inst;
  let Namespace = "MyTarget";
  let Size = 4;
  dag OutOperandList = outs;
  dag InOperandList = ins;
  let AsmString   = asmstr;
  let Pattern = pattern;
}


All instruction classes are inherited by the one above, specifying what the instruction bits are representing. Since the target supports different types of instruction format, for each of them is defined as a different class. Each class is then defined in terms of other sub-classes. For example, it may be useful to analyze the R format, that is the instruction format used to handle logical and arithmetic operations and memory operations. By analyzing the instruction type it may be useful to distinguish among different sub-types, by structuring a hierarchy.

This picture shows how the classes are inherited from the FR class.

In this way, we shall define a set of classes implementing the hierarchy exposed, from the root, that represents the most generic R-type instruction, to leaves. Since the instruction format requires a set of FMT bits, it is also necessary to define a class to handle them:

class Fmt<bit val> {
  bit Value = val;
}

The root class is defined as the most general class of the hierarchy. For this reason, the second source operand is not defined. The derived class, implementing the two-operand R-type instruction, will properly set the related field.

class FR<dag outs, dag ins, string asmstr, list<dag> pattern, bits<6> opcode, Fmt fmt2, Fmt fmt1, Fmt fmt0>
   : InstNaplesPU<outs, ins, asmstr, pattern> {
  bits <6> dst;
  bits <6> src0;
  let Inst{31-30} = 0;
  let Inst{29-24} = opcode;
  let Inst{23-18} = dst;
  let Inst{17-12} = src0;
  let Inst{5} = 0; //unused
  let Inst{4} = 0; //No 64-bit support
  let Inst{3} = fmt2.Value;
  let Inst{2} = fmt1.Value;
  let Inst{1} = fmt0.Value;
  let Inst{0} = 1;
  let Uses=[MR_REG];
}

class FR_TwoOp<dag outs, dag ins, string asmstr, list<dag> pattern, bits<6> opcode, Fmt fmt2, Fmt fmt1, Fmt fmt0>
   : FR<outs, ins, asmstr, pattern, opcode, fmt2, fmt1, fmt0> {
  bits <6> src1;
  let Inst{11-6} = src1;
  
}


class FR_OneOp<dag outs, dag ins, string asmstr, list<dag> pattern, bits<6> opcode, Fmt fmt2, Fmt fmt1, Fmt fmt0>
   : FR_TwoOp<outs, ins, asmstr, pattern, opcode, fmt2, fmt1, fmt0> {
   let Inst{11-6} = 0;
}

The code above is showing the implementation of the classes hierarchy illustrated above. All the other instruction format are implementing by using the same method.

However, the class definition method is not a flexible way to explain the commonalities among more than two definition instances. As a solution, LLVM provides the multiclass construct. For example, if an arithmetic instruction is designed to work both with an immediate value or two registers, the class construct forces to have implementation as described below.

class rrinst<int opc, string asmstr>
  : inst<opc, !strconcat(asmstr, " $dst, $src1, $src2"),
         (ops GPR:$dst, GPR:$src1, GPR:$src2)>;
class riinst<int opc, string asmstr>
  : inst<opc, !strconcat(asmstr, " $dst, $src1, $src2"),
         (ops GPR:$dst, GPR:$src1, Imm:$src2)>;
def ADD_rr : rrinst<0b111, "add">;
def ADD_ri : riinst<0b111, "add">;
...

The usage of the class construct in this context is quietly unfeasible. Differently, below is showed how clean is the approach using the multiclass construct.

multiclass ri_inst<int opc, string asmstr> {
  def _rr : inst<opc, !strconcat(asmstr, " $dst, $src1, $src2"),
                 (ops GPR:$dst, GPR:$src1, GPR:$src2)>;
  def _ri : inst<opc, !strconcat(asmstr, " $dst, $src1, $src2"),
                 (ops GPR:$dst, GPR:$src1, Imm:$src2)>;
}
defm ADD : ri_inst<0b111, "add">;
...

The multiclass construct is used to handle different register types for arithmetic operations. Therefore, despite defining several classes, one for each combination of scalar and vectorial registers as source operands, the usage of the multiclass construct is the best solution to do it.

multiclass FArithInt_TwoOp<string operator, SDNode OpNode, bits<6> opcode> {
  def SSS_32 : FR_TwoOp<
    (outs GPR32:$dst),
    (ins GPR32:$src0, GPR32:$src1),
    operator # "_i32 $dst, $src0, $src1",
    [(set i32:$dst, (OpNode i32:$src0, i32:$src1))],
    opcode,
    Fmt_S,
    Fmt_S,
    Fmt_S>;

Pattern Recognition

The Instruction Selection phase tries to substitute the IR instruction with the target-specific one. This pass is structured in two ways implemented in Select() method of the SelectionDAGISel class: the first way tries to find an instruction matching by looking at the target description provided by the compiled TableGen files, while the second way is looking to a custom code matching defined by the user.

Consequently, it is useful to define instruction patterns in order to help the Select method to find a proper match. TableGen defines a set of classes that are used to declare a pattern.

For example, an important concept implemented is the Pattern Fragment, that is a reusable chunk of DAG that matches specific nodes. LLVM provides few implementation of the mentioned class, starting from the PatFrag, that is the most general class, to OutPatFrag, PatLeaf and ImmLeaf inherited from the first one.

ThePatFrag definition is showed below:

class PatFrag<dag ops, dag frag, code pred = [{}], SDNodeXForm xform = NOOP_SDNodeXForm> : SDPatternOperator

where:

  • ops represents the input operands;
  • frag is the fragment to be matched;
  • pred is the predicate to be satisfied in order to apply the transformation;
  • xform is the transformation to be applied if all the previous points are fulfilled.

In NaplesPU, the PatFrag class is used to handle the store operations, in terms of the address space they refer to. In our purposes, it is required to define a store operation in the global memory if and only if it is satisfied with a condition on the address space identification number. Otherwise, the memory operation refers to the scratchpad address space.

def MemStore : PatFrag<(ops node:$val, node:$ptr),
                       (store node:$val, node:$ptr), [{
               if(cast<StoreSDNode>(N)->getAddressSpace() != 77)
                  return !cast<StoreSDNode>(N)->isTruncatingStore();
               else
                  return false;}]>;
def ScratchpadStore : PatFrag<(ops node:$val, node:$ptr),
                              (store node:$val, node:$ptr), [{
               if(cast<StoreSDNode>(N)->getAddressSpace() == 77)
                 return !cast<StoreSDNode>(N)->isTruncatingStore();
               else
                 return false;}]>;

PatLeaf is the a PatFrag sub-class, deprived of the transformation and the pattern parameters, used, for example, to handle the immediate values transformation.

def simm16 : PatLeaf<(imm), [{ return isInt<16>(N->getSExtValue()); }]>;
def simm9 : PatLeaf<(imm), [{ return isInt<9>(N->getSExtValue()); }]>;

The pattern exposed is used to recognise if the immediate is 9-bit wide or 16-bit wide. LLVM also provides a class that is used to define more complex patterns than the ones explained above. It is typically used to handle the addressing modes. The described behaviour is implemented in the class ComplexPattern:

class ComplexPattern<ValueType ty, int numops, string fn, list<SDNode> roots = [], list<SDNodeProperty> props = []>

where:

  • ty is the type associated to the pattern;
  • numops is the number of operands returned by the function;
  • n is the name of the function;
  • roots is the list of possible \textit{root nodes} of the sub-DAG to match;
  • props is a list of possible predicates to match.

As mentioned above, this pattern is usually used to handle the addressing modes. In our purpose, the pattern is used to evaluate the address as base plus offset. For this reason, it is necessary to define a proper function in the Select() method NaplesPUISelDAGToDAG.cpp file.

The following line shows how to define a ComplexPattern instance.

def ADDRri:ComplexPattern<iPTR, 2, "SelectADDRri", [frameindex], []>;

The pattern above calls the method SelectADDRri defined in the NaplesPUISelDAGToDAG.cpp when an SDNode is of iPTR type.

Instructions definition

This section describes how to use the formats defined in NaplesPUInstrFormats.td to instantiate the target-specific instructions. Because of the instruction formats definition described before, the target-specific ISA can be realized by simply instantiating the formats themselves.

For instance, the ADD instruction is of the arithmetic format, that is implemented by using the multiclass construct. As a consequence, the instantiation is:

defm ADD:FArithInt_TwoOp<"add", add, 4>

The other arithmetic instructions are implemented in a similar way as above. Thanks to the multiclass construct a single definition instantiates scalar and vector variants of the ADD instruction.

Instruction definition is located in NaplesPUInstrInfo.td.

There are some cases in which the TableGen generated outputs are not enough to produce a legalized DAG composed of only machine instruction nodes. In these cases could be useful to define a pattern that helps the instruction selection algorithm to find a match. Beyond the PatFrag class, LLVM provides an explicit way to force a pattern matching through the class Pat, defined as follows:

class Pat<dag pattern, dag result> : Pattern<pattern, [result]>;

For instance, consider the frontend intrinsic int_npu_vector_mix32. It is converted in a target-specific node through the Pat class, as explained below.

def : Pat<(int_npu_vector_mixf32 v16f32:$src0, v16f32:$oldvalue), (MOVE_VV_M_32 v16f32:$src0, v16f32:$oldvalue)>; }

As a result of the pattern application, the considered node is transformed into MOVE_VV_M_32 that is a target-specific node.

Even though the use of Pat s an easy-to-apply solution to help the instruction selector, its power is limited to simple substitutions. More complex behaviors have to be managed manually in the Lowering Pass.

Instruction Selection Lowering

Because the TableGen description is not enough to have a full substitution of target-dependent nodes, LLVM provides a C++ file to implement the lowering pass, that NaplesPUISelLowering.cpp. In our purpose, the lowering mainly concerns the pseudo-instructions defined in NaplesPUInstrInfo.td. Recall that a pseudo-instruction is a special instruction that does not map to any machine code. For example, it is helpful to analyze how is managed the load operation associated to a 32-bit immediate value. The 32-bit instruction format allows the move operation for an immediate operand that is at most 16-bit wide. To handle wider immediate operands it is necessary to find a solution. By looking at NaplesPUInstrInfo.td, the mentioned operation, named LoadI32, is defined as a pseudo-instruction with usesCustomInserter bit set, as showed below.

let usesCustomInserter = 1 in {
  def LoadI32 : Pseudo<
    (outs GPR32:$dst),
    (ins GPR32:$val),
    [(set i32:$dst, (i32 imm:$val))]>;
 ...

The LLVM APIs are designed to deal with DAG nodes. To handle the Load32I operation, MOVEIH and MOVEIL instructions are used. For this reason, the 32-bit value is split into two sub-values, each one of 16-bit wide that can be used to build the machine instruction nodes.

MachineBasicBlock *
NaplesPUTargetLowering::EmitInstrWithCustomInserter(MachineInstr &MI,
                                                 MachineBasicBlock *BB) const {
  switch (MI.getOpcode()) {
  case NaplesPU::LoadI32:
  	return EmitLoadI32(&MI, BB);
  ...

The LoadI32 is split into two machine instruction, one that uses the MOVEIH instruction and the other that uses the MOVEIL instruction, as the code below explains.

MachineBasicBlock* EmitLoadI32(...) const {
  DebugLoc DL = MI->getDebugLoc();
  const TargetInstrInfo *TII = Subtarget.getInstrInfo();
  int64_t ImmOp = MI->getOperand(1).getImm(); 
if((ImmOp & 0xFFFF0000) != 0)
  InsertLoad32Immediate(...);
...
  MI->eraseFromParent();
  return BB;
}

void InsertLoad32Immediate (...) {
 BuildMI(*BB, *MI, *DL, TII->get(NaplesPU::MOVEIHSI))
         .addReg(DestReg, RegState::Define)
         .addImm(((Immediate >> 16) & 0xFFFF));
BuildMI(*BB, *MI, *DL, TII->get(NaplesPU::MOVEILSI))
        .addReg(DestReg)
        .addImm((Immediate & 0xFFFF));
}

Frame Lowering

This section focuses on the definition of information about the stack frame layout on the target. For this purpose, LLVM provides a target-independent interface, defined in TargetFrameLowering, that has to be implemented for the target platform. It holds several pieces of information such as the direction of stack growth, the known stack alignment on entry to each function, and the offset to the local area.

The custom platform is provided of a down-growing stack, 64-bit aligned. The specific implementation of the interface TargetFrameLowering is defined in the file NaplesPUFrameLowering.cpp, and it includes the development of the following functions:

  • emitPrologue and emitEpilogue, used to add the prologue and the epilogue code to a function, in terms of stack managing.
  • eliminateCallFramePseudoInstr is the function devolved to replace the IR Call-Frame composed by pseudo-instructions with the target-specific one.
  • hasReservedCallFrame is a boolean function that is used to check whether the prologue inserter should reserve stack space or not.
  • hasFP is a boolean function that returns true if the specified function should have a dedicated frame pointer register.
  • determineCalleeSaves determines which of CalleeSavedRegs should be actually saved.

Code Emission Support

The code emission phase starts with the AsmPrinter pass. The core of this pass is the EmitInstruction function, that requires an MachineInstr object as input and transforms it in a MCInst instance. This transformation is done by using the Lower method, defined in NaplesPUMCInstLower.cpp. The MCInst instances generated are sent to the MCStreamer, that can switch between two different ways: the generation of the binary target-specific code or the generation of the assembly code by using the MCObjectStreamer object in the first case and the MCAsmStreamer one in the latter. The target-specific behavior is implemented in the class NaplesPUInstPrinter.cpp to handle the assembly code generation and in the class NaplesPUMCCodeEmitter.cpp to handle the binary code generation.