Difference between revisions of "Programming Model"

From NaplesPU Documentation
Jump to: navigation, search
(Matrix Multiplication Multithreaded Example)
(Matrix Multiplication Multithreaded Example)
Line 21: Line 21:
 
  #define CORE_ID    __builtin_nuplus_read_control_reg(0)
 
  #define CORE_ID    __builtin_nuplus_read_control_reg(0)
 
  #define THREAD_ID  __builtin_nuplus_read_control_reg(2)
 
  #define THREAD_ID  __builtin_nuplus_read_control_reg(2)
 +
 +
In this way, each thread in each core computes a portion of the output matrix concurrently to the others.
 +
 +
The final result
 +
 +
The full code for the main function is attached below:
 +
 +
int main(){
 +
    init_matrix(A);
 +
    init_matrix(B);
 +
 +
    matrix_mult(A, B, C, CORE_ID, THREAD_ID);
 +
    __builtin_nuplus_barrier(CORE_ID + 1, THREAD_NUMB - 1);
 +
 +
    if (THREAD_ID == 0 && CORE_ID == 0) {
 +
      for (int i = 0; i < N*N; i += 64 / sizeof(int)) {
 +
        __builtin_nuplus_flush((int) &C[i / N][i % N]);
 +
      }
 +
      __builtin_nuplus_write_control_reg(N*N, 12); // For cosimulation purpose
 +
    }
 +
 +
  return (int)&C;
 +
}
  
 
== OpenCL support for Nu+ ==
 
== OpenCL support for Nu+ ==

Revision as of 16:16, 3 June 2019

Nu+ Programming Model

Matrix Multiplication Multithreaded Example

The following code shows a multithread version of the matrix multiplication kernel running on a nu+ core. The code spreads the output computation among all available threads. Since each output element computation is independent of others, the thread parallelism is achieved dividing the outer cycle of the function. Output matrix row calculations are equally spanned across all cores and further spanned across threads. For each thread, the function first calculates the portion of the output matrix to compute at the core level: N / CORE_NUMB, with N the dimension of the matrix and CORE_NUMB the number of nu+ core in the system. Then, each thread starts computing the outer loop with start_loop = (core_id * N / CORE_NUMB) + thread_id; the portion of output matrix to compute is multiplied to the running core ID (core_id) and added to the running thread ID (thread_id), and each iteration increments by the number of threads in the core THREAD_NUMB.

void matrix_mult(const int a[N][N], const int b[N][N], int mult[N][N], int core_id, int thread_id) {
 int start_loop = (core_id * N / CORE_NUMB) + thread_id;
 int end_loop = N / CORE_NUMB * (core_id + 1);
 
 for (int i = start_loop; i < end_loop; i += THREAD_NUMB){
   for (int j = 0; j < N; j++)
     for (int k = 0; k < N; k++)
       mult[i][j] += a[i][k] * b[k][j];
 }
}

Parameters core_id and thread_id are passed by the main function, and fetched from nu+ control registerr through builtins:

#define CORE_ID    __builtin_nuplus_read_control_reg(0)
#define THREAD_ID  __builtin_nuplus_read_control_reg(2)

In this way, each thread in each core computes a portion of the output matrix concurrently to the others.

The final result

The full code for the main function is attached below:

int main(){
   init_matrix(A);
   init_matrix(B);

   matrix_mult(A, B, C, CORE_ID, THREAD_ID);
   __builtin_nuplus_barrier(CORE_ID + 1, THREAD_NUMB - 1);
   if (THREAD_ID == 0 && CORE_ID == 0) {
     for (int i = 0; i < N*N; i += 64 / sizeof(int)) {
       __builtin_nuplus_flush((int) &C[i / N][i % N]);
     }
     __builtin_nuplus_write_control_reg(N*N, 12); // For cosimulation purpose
   }

  return (int)&C;
}

OpenCL support for Nu+

OpenCL defines a platform as a set of computing devices on which the host is connected to. Each device is further divided into several compute units (CUs), each of them defined as a collection of processing elements (PEs). Recall that the target platform is architecturally designed around a single core, structured in terms of a set of at most eight hardware threads. Each hardware threads competes with each other to have access to sixteen hardware lanes, to execute both scalar and vector operations on 32-bit wide integer or floating-point operands.

Plat model.png

The computing device abstraction is physically mapped on the nu+ many-core architecture. The nu+ many-core can be configured in terms of the number of cores. Each nu+ core maps on the OpenCL Compute Unit. Internally, the nu+ core is composed of hardware threads, each of them represents the abstraction of the OpenCL processing element.

Execution Model Matching

From the execution model point of view, OpenCL relies on an N-dimensional index space, where each point represents a kernel instance execution. Since the physical kernel instance execution is done by the hardware threads, the OpenCL work-item is mapped onto a nu+ single hardware-thread. Consequently, a work-group is defined as a set of hardware threads, and all work-items in a work-group executes on a single computing unit.

Execution model.png

Memory Model Matching

OpenCL partitions the memory in four different spaces:

  • the global and constant spaces: elements in those spaces are accessible by all work-items in all work-group.
  • the local space: visible only to work-items within a work-group.
  • the private space: visible to only a single work-item.

The target-platform is provided of a DDR memory, that is the device memory in OpenCL nomenclature. Consequently, variables are physically mapped on this memory. The compiler itself, by looking at the address-space qualifier, verifies the OpenCL constraints are satisfied.

Each nu+ core is also equipped with a Scratchpad Memory, an on-chip non-coherent memory portion exclusive to each core. This memory is compliant with the OpenCL local memory features

Memory mod.png

Finally, each hardware thread within the nu+ core has a private stack. This memory section is private to each hardware thread, that is the OpenCL work-item, and cannot be addressed by others. As a result, each stack acts as the OpenCL private memory.

Programming Model Matching

OpenCL supports two kinds of programming models, data- and task-parallel. A data-parallel model requires that each point of the OpenCL index space executes a kernel instance. Since each point represents a work-item and these are mapped on hardware threads, the data-parallel requirements are correctly satisfied. Please note that the implemented model is a relaxed version, without requiring a strictly one-to-one mapping on data.

A task-parallel programming model requires kernel instances are independently executed in any point of the index space. In this case, each work-item is not constrained to execute the same kernel instance of others. The compiler frontend defines a set of builtins that may be used for this purpose. Moreover, each nu + core is built of 16 hardware lanes, useful to realise a lock-step execution. Consequently, OpenCL support is realised to allow the usage of vector types. As a result, the vector execution is supported for the following data types:

  • charn, ucharn, are respectively mapped on vec16i8 and vec16u8, where n=16. Other values of n are not supported.
  • shortn, ushortn, are respectively mapped on vec16i16 and vec16i32, where n=16. Other values of n are not supported.
  • intn, uintn, are respectively mapped on vec16i32 and vec16u32, where n=16. Other values of n are not supported.
  • floatn, mapped on vec16f32, where n=16. Other values of n are not supported.

OpenCL Runtime Design

OpenCL APIs are a set of functions meant to coordinate and manage devices, those provide support for running applications and monitoring their execution. These APIs also provides a way to retrieve device-related information.

The following Figure depicts the UML Class diagram for the OpenCL Runtime as defined in the OpenCL specification. Grey-filled boxes represent not available features due to the absence of hardware support.

Uml.png

The custom OpenCL runtime relies on two main abstractions:

  • Low-level abstractions, not entirely hardware dependent, provide device-host communication support.
  • High-level abstractions, according to OpenCL APIs, administrate the life cycle of a kernel running onto the device.

OpenCL Example

The following code shows a vectorial matrix multiplication in OpenCL running on the nu+ device.

#include <opencl_stdlib.h>
#define WORK_DIM 4

__kernel void kernel_function(__global int16 *A, __global int16 *B, __global int16 *C, int rows, int cols)
{
   __private uint32_t threadId = get_local_id(0); 

   uint32_t nT = WORK_DIM; // number of threads
   uint32_t nL = 16;       // number of lanes

   uint32_t N = rows;
   uint32_t nC = N / nL;
   uint32_t ndivnT = N / nT;
   uint32_t tIdndivnT = threadId * ndivnT;
   uint32_t tIdndivnTnC = tIdndivnT * nC;
   for (uint32_t i = 0; i < ndivnT * nC; i++)
   {
       uint32_t col = (tIdndivnT + i) % nC;
       C[tIdndivnTnC + i] = 0;
       for (uint32_t j = 0; j < nC; j++)
       {
           for (uint32_t k = 0; k < nL; k++)
           {
               C[tIdndivnTnC + i] += A[tIdndivnTnC + i - col + j][k] * B[(nC * k) + (j * N) + col];
           }
       }
   }
}