Extending NaplesPU for OpenCL support

From NaplesPU Documentation
Jump to: navigation, search

This section describes how the OpenCL C kernel compilation is supported in the NaplesPU toolchain. Since Clang natively supports the IR generation of OpenCL C kernels, our work focused on the adaption of the generated code on the NaplesPU core. As the NaplesPU start routine requires to jump to the main label after having initialized all necessary resources, it is necessary to build a custom optimization pass, to auto-generate the main-function and to define the calling frame to the kernel function. Also OpenCL requires the implementation of custom libraries, extending the libc set.

LLVM IR Optimization

An IR optimization can be of two different types:

  • Analysis, does not provide a modification of the intermediate representation;
  • Transform, produces a modified IR as output.

LLVM provides an API to create a new optimization pass, hierarchically implemented on the Pass Class from which the following ones inherit:

  • ModulePass: it is the most general pass. It allows the whole module and the function of which it is composed to be checked.
  • FunctionPass: it allows functions defined in all modules (except for the external ones) to be managed without pursuing a specific order. Functions cannot be deleted.
  • BasicBlockPass: it allows basic blocks to be managed. A basic block can be modified but not removed.

As a consequence of the classification provided above, for our purpose, the definition of a ModulePass is the most suitable choice.

To write a custom module pass, it is necessary to define a new class by inheriting the ModulePass one and overriding the runOnModule() method. It contains the code that is executed at each LLVM per-module optimization. If the defined pass modifies the IR, the runOnModule function returns a true value. Otherwise, it returns a false value. As previously discussed, the custom pass should create the main function by parsing the arguments from the argv field, and then by calling the kernel function with the right parameters.

The generated code should be of the form as described below.

; Function Attrs: nounwind
define void @main(i32, i8** nocapture readonly) local_unnamed_addr #0 {
  %3 = bitcast i8** %1 to i32*
  %4 = load i32, i32* %3, align 4
  ...
  tail call void @kernel_f(i32 %4, ...) #2
  ret void
}

As a consequence, the ModulePass should be structured in the following points:

  • Definition of the main function signature:
FunctionType *main_type = 
TypeBuilder<void(int, char**), false>::get(ctx);
Function *func = 
cast<Function>(M.getOrInsertFunction("main", main_type));
  • Creation of an entry labeled basic block:
BasicBlock *block = BasicBlock::Create(ctx, "entry", func, 0);
builder.SetInsertPoint(block);
  • For each kernel argument, create an aligned load instruction to retrieve the argument from the argv array and add it to the kernel arguments:
argument_list[counter] = builder.CreateLoad(arg.getType(),
	builder.CreateBitCast(builder.CreateGEP(&argv, 
	builder.getInt32(index)), 
	arg.getType()->getPointerTo()));
argument_list[counter]->setAlignment(4);
kernel_arguments.push_back(argument_list[counter++]);
  • Kernel function call:
builder.CreateCall(kernel_function, kernel_arguments);

Once the pass is defined, the next step consists to register the pass in the PassManager to add it to the IR optimization pipeline. LLVM provides a set of macros that are to be used to initialize the pass:

INITIALIZE_PASS_BEGIN(...)
INITIALIZE_PASS_END(...)

and to define the pass dependencies:

INITIALIZE_PASS_DEPENDENCIES(...)

The pass initialization must be linked through the InitializePasses.h header file, in which the following function has to be defined:

void initialize<PassName>Pass(PassRegistry &);

In addition, the specific ModulePass constructor must be declared in the LinkAllPasses.h header file.

By following these steps, the optimization pass created is automatically chained to the opt tool pipeline. Moreover, it is useful to automatically execute the pass in each clang compilation. In order to realize the desired behaviour, the code generation step has to be modified. Clang provides the EmitAssemblyHelper::CreatePasses method to add a pass to the PassManager.

As an example, consider the following trivial OpenCL kernel:

__kernel void kernel_function(unsigned param,
  __global unsigned* out)
{

 unsigned int i = 0;
 *out = i + param;

}

The optimization pass creates the main function as follows:

define void @main(i32, i8** nocapture readonly) local_unnamed_addr #1 {
entry:
  %2 = getelementptr i8*, i8** %1, i64 6
  %3 = bitcast i8** %2 to i32*
  %4 = load i32, i32* %3, align 4
  %5 = getelementptr i8*, i8** %1, i64 7
  %6 = bitcast i8** %5 to i32**
  %7 = load i32*, i32** %6, align 4
  tail call void @kernel_function(i32 %4, i32* %7)
  ret void
}

As a result, the NaplesPU processor is now capable to run OpenCL kernels.

Library Extensions

OpenCL provides a set of vector types to explicitly support the lock-step execution. Since NaplesPU is internally structured in a set of hardware lanes, the natively supported vector types are the following:

  • char16, uchar16 are respectively mapped on vec16i8 and vec16u8;
  • short16 , ushort16 are respectively mapped on vec16i16 and vec16i32
  • int16, uint16 are respectively mapped on vec16i32 and vec16u32;
  • float16 is mapped on vec16f32.


The OpenCL support for these vector types is provided by adapting the compiler standard library. The adaption involves the modification of the stdint.h library, by adding the lines below.

typedef char char16 __attribute__((ext_vector_type(16)));
typedef unsigned char uchar16 __attribute__((ext_vector_type(16)));
typedef short short16 __attribute__((ext_vector_type(16)));
typedef unsigned short ushort16 __attribute__((ext_vector_type(16)));
typedef int int16 __attribute__((ext_vector_type(16)));
typedef unsigned int uint16 __attribute__((ext_vector_type(16)));
typedef float float16 __attribute__((ext_vector_type(16)));

OpenCL supports a set of builtins used by work-items to retrieve several pieces of information about the enqueued kernel, such as work-item local and global IDs, index space sizes and so on. The implementation of these functions relies on the target builtins to obtain information stored on platform registers. For instance, take a look to the following code.

#define WORK_DIM 1

uint get_work_dim () {
 return WORK_DIM; 
}

uint get_local_id(uint dimindx){
 return __builtin_NaplesPU_read_control_reg(LOCAL_ID);
}

uint get_global_id(uint dimindx){
 return __builtin_NaplesPU_read_control_reg(GLOBAL_ID);
}

uint get_group_id(uint dimindx){
 return __builtin_NaplesPU_read_control_reg(CORE_ID);
}

As it is exposed, the work-item builtins implementation maps on NaplesPU builtins, in order to read the control register containing the desired value. Also take note of the WORK_DIM macro, statically defined as 1, because of the hardware constraints.