OpenCL

From NaplesPU Documentation
Revision as of 17:15, 11 October 2017 by Catello (talk | contribs) (Manage Execution)
Jump to: navigation, search

The OpenCL support for the nu+ architecture is made through pocl.

How to install vanilla pocl

  1. Download following the link.
  2. In order to build pocl, you need the following support libraries and tools:
    • Latest released version of LLVM & Clang
    • GNU make
    • libtool dlopen wrapper files (e.g. libltdl3-dev in Debian)
    • pthread (should be installed by default)
    • hwloc v1.0 or newer (e.g. libhwloc-dev)
    • pkg-config
    • cmake
    • libclang-3.8-dev if you are using Ubuntu 16.04 LTS
    On Ubuntu 16.04 LTS you can run the following code on a terminal
    sudo apt-get install llvm & clang & libltdl3-dev & libhwloc-dev & pkg-config & libclang-3.8-dev & make & cmake
  3. Build and install
    cd <directory-with-pocl-sources>
    mkdir build
    cd build
    cmake [-D<option>=<value> ...] ..
    make && make install

Using pocl

To compile with pocl you have to execute:

gcc example1.c -o example `pkg-config --libs --cflags pocl`

See [1] and [2] for further informations.

Modify pocl

Adding a new device class in pocl

  • Create a directory for the new device class in "lib/CL/devices". In this case the "nuplus" folder is created.
  • Create at least the files newdevice.c and newdevice.h. In this case "nuplus.c" and "nuplus.h"
  • Create the CMakeList.txt file in the device folder, specifying the files created before and the device name.
     if(MSVC)
     set_source_files_properties( nuplus.h nuplus.c PROPERTIES LANGUAGE CXX )
     endif(MSVC)
     add_library("pocl-devices-nuplus" OBJECT nuplus.h nuplus.c)
  • Modify the "lib/CL/devices.c" file including the new header file for the device and adding the init fucntion in the vector pocl_devices_init_ops
    # include "nuplus/nuplus.h"
    
     ...
    
     static init_device_ops pocl_devices_init_ops[] = {
     pocl_pthread_init_device_ops,
     pocl_basic_init_device_ops,
     pocl_nuplus_init_device_ops,
    # if defined(TCE_AVAILABLE)
     pocl_ttasim_init_device_ops,
    # endif
    # if defined(BUILD_HSA)
     pocl_hsa_init_device_ops,
    # endif
     };
    
     ...


  • Modify the "lib/CL/devices/CMakeLists.txt" adding the new device subdirectory name.
     ...
    
     add_subdirectory("nuplus")
    
     ...
  • Modify the "CMakeLists.txt" in the pocl root directory adding the new device name to the "OCL_DRIVERS".
     ...
    
     set(OCL_DRIVERS "basic pthreads nuplus")
    
     ...

Build pocl with a custom LLVM

Prerequisites

Be sure that the LLVM compiler is built with all the targets and the default target is set as the host. For the nu+ toolchain the CMakeLists.txt in the LLVM root folder must be modified as reported below:

  1. Delete the code
    set(LLVM_TARGETS_TO_BUILD "NuPlus"
        CACHE STRING "Semicolon-separated list of targets to build, or \"all\".")
  2. modify
    set(LLVM_DEFAULT_TARGET_TRIPLE "nuplus-none-none" CACHE STRING
    in
    set(LLVM_DEFAULT_TARGET_TRIPLE "${LLVM_HOST_TRIPLE}" CACHE STRING

Building pocl with nu+ compiler

  1. Build and install nu+ compiler
  2. Build and install pocl
    cd <directory-with-pocl-sources>
    mkdir build
    cd build
    cmake -DWITH_LLVM_CONFIG=/usr/local/llvm-nuplus/bin/llvm-config ..
    make && make install

OpenCL support implementation

To enable OpenCL support, a new pocl device has been added. The device implements the device operations interface provided by pocl. The operations are: query devices, manage memory, transfer data, generate machine code, and manage execution.

Query devices

To query the amount of available devices, the nu+ device-layer scans the amount of fpga devices connected through USB to the host. This allows adding multiple accelerators to a system, each being used by a different program.

Manage Memory

The management of the memory of the accelerator is done by the device-layer on the host. As a kernel instance can not dynamically allocate memory, the amount of memory that needs to be reserved is known before an instance is executed. The nu+ device-layer uses the Bufalloc memory allocator that is included in pocl. This allocator is designed for typical OpenCL workloads and uses a simple first fit algorithm to find free space. As only one OpenCL program can use the device at the same time, the entire memory is available to the allocator.

Transfer Data

Data transfer is done using the nu+ driver. Read and write requests to the device memory are performed using nuplus_read and nuplus_write. To set the location in memory where the data should be written, the nuplus_lseek function is used.

Work-group functions

From the perspective of the programmer, all kernel instances in a work-group are executed concurrently. Instances in the same work-group can communicate with each other using local variables and wait on each other using barriers. A seemingly straightforward implementation would execute all work-groups in parallel. This becomes complicated and expensive when there are more instances in a work-group than the amount of compute units, as that would require scheduling and context-switching. In pocl, all kernel instances in a work-group are combined into a single instance, called a work-group function. pocl creates loops over all work-group items in “all parallel regions”, regions between barriers. One of the main benefits of this technique is that a device can simply run work-group functions in parallel without having to implement shared memory and barriers, significantly simplifying the design of an accelerator. A large drawback is that a new work-group functions needs to be compiled for different work-group sizes. As the work-group size is only known when the kernel is enqueued, first-time execution of a kernel with a not yet used configuration might take multiple times longer than expected.

Kernel signature transformation

A kernel instance has two extra types of input in addition to its explicit arguments: variables declared in the local address space and the execution context. In pocl, these inputs are passed to the kernel function as extra arguments. There are two types of variables that can be declared in the local address space.

If a kernel argument is a pointer to a local variable, all kernel instances in the same work-group should get a pointer to the same variable. Variables declared on the stack and marked as local should also be shared among instances in the same work-group. pocl adds an extra pass over the LLVM Intermediate Representation (IR) to extract local variables and add them as arguments to the kernel function.

The execution context describes the configuration of the index space and the index of the kernel instance. The kernel can access this information using the work-item functions: get global id, get global size, get group id, get local id, get local size, get num groups, and get work dim. pocl adds a pointer to a pocl context structure as the final argument to every kernel.

This structure encodes the execution context and is used to implement the work-item functions.

Argument List

Before execution of the kernel, the arguments and local variables need to be copied to the device. To accomplish this, a region of memory is allocated to store these arguments. This region is called the argument list. The argument list consists of three parts: a list of pointers to the original arguments, a list of pointers to local variables declared in the kernel, and a buffer where the actual arguments are stored (the argument buffer).

Execution Context

The execution context consists of thefollowing variables:

  • work dim: The amount of dimensions of the index space. A value between 1 and

3.

  • num groups: The number of work-groups of this execution, for each dimension.
  • group id: The identifier of the currently executing work-group function.
  • global offset: The offset at which global identifier start.

When the work-group functions are generated, the work-group local identifier is stored in a variable. The global identifier for a dimension 0 ≤ d < 3 is given by

group id[d] · num groups[d] + local id[d] + global offset[d].

On-device Setup

Before the kernel function can be called, the correct arguments need to be loaded. This is done by the work-group loader, called workgroup_fast in case of heterogeneous host-device setup. This loader reads the kernel function arguments from the argument list and calls the kernel.

As the amount and type of arguments differs between kernels, a special work-group loader is generated for each kernel.

The entry point of the program loaded onto the device is the start procedure. This small procedure written in assembly initializes the start pointer, loads the correct addresses of the argument list and execution context and calls the work-group loader. When the work-group loader returns the execution is halted, allowing the host to detect that the kernel is done. In the current implementation of the device-layer, the addresses of the argument list and the execution context are stored in the first eight bytes of the kernel binary.

Generate Machine Code

Generating executable machine code for the nu+ architecture consists of two main phases: code generation and linking. The first step generates machine code from the LLVM IR, the second sets all global memory references to the correct value.

In OpenCL, the kernel source code is specified to the framework using the clCreateProgramWithSource function. This source code is build upon a call to the clBuildProgram function. At this point, pocl will use Clang to compile the source code to LLVM IR. Extraction of local variables and transformation of the arguments is done when the clCreateKernel function is called. The machine code can only be generated when the clEnqueueNDRangeKernel is executed, as only at that moment the work-item size is known and the work-group function is generated.

To create machine code from the work-group function, first the LLVM nu+ back-end is executed. This will generate assembly code from the LLVM IR in which the work-group function is defined. Next, the assembly is turned into an object file by the assembler. The "start.s" file containing the entry point for the device is also assembled into an object file. Both object files are then linked together to create the machine code.

At the previous link step, the location in memory of the program is not yet known. This is required as the LLVM nu+ back-end currently does not support position-independent code generation and the nu+ system does not have a memory management unit. The location of the machine code in memory is known just before execution, when it is allocated. Before the code is transfered to the memory of the accelerator, the linker is run a final time to correctly set the memory location.

Manage Execution

Before execution of kernel instances can start, the device needs to be prepared. Preparation consists of the following steps:

  1. allocating memory for the argument list and the execution context
  2. transferring the argument list to device memory
  3. allocating memory for the kernel machine code
  4. linking the kernel machine code to its final location
  5. setting memory addresses of the argument list and execution context at start of kernel binary
  6. transferring kernel binary to device memory

The execution of the nu+ system is managed with the control functions provided by the driver. After transferring the kernel binary to device memory, a structure with system informations (like the Program Counter per thread, the cores to activate, ect.) has to be built. Once this structure is available, the nuplus_boot function is used to boot the nu+ system and the nuplus_run to run the execution. To check if the execution is finished, the nuplus_get_done function is polled.