Applications

Complex Product Kernel

A kernel running the complex product has been implemented in order to achieve cross verifications, data path evaluations and various types of considerations.

Scalar Registers

In the first instance, the implemented program executed a complex product of the type (a + jb)*(c + jd), where the values of the real and imaginary parts are saved in separate variables, for a total of four integer variables (two real parts and two imaginary parts). The output result is saved in two variables (real and imaginary part). In the first implementation of the kernel, the four variables were declared in the main program. This choice did not produce any results during the kernel emulation phase, since the values of the source operands were not loaded in the scalar registers of the processor. This is due to the fact that the variables are local and so they were loaded in the stack area that is limited in size.

int main () { 
  int ReA=3; // parte reale primo numero
  int ImA=4; // parte immaginaria numero
  int ReB=5; // parte reale secondo numero
  int ImB=6; // parte immaginaria secondo numero
  int Re=0;
  int Im=0;
  Re= (ReA*ReB)-(ImA*ImB);
  Im= (ReA*ImB)+(ReB*ImA);
  return 0;
}

To overcome this limitation, variables have been declared globally. So the source operand values have been loaded correctly in the scalar registers of the processor. In addition, the right result has been produced and written in memory.

int ReA=3; 
int ImA=4; 
int ReB=5; 
int ImB=6; 
int Re=0;
int Im=0;

int main(){
  Re= (ReA*ReB)-(ImA*ImB);
  Im= (ReA*ImB)+(ReB*ImA);
  return 0;
}

In this last implementation, the value of the program counter at the time of loading all source operands in the scalar registers was $104 (hexadecimal value). The second and last store (two stores are required: one for the real part and one for the imaginary part) occurred when the PC value was $134. So, the implemented kernel used 13 instructions ($134-$104=$30 = 48; (48/4)+1=13) from the time of loading at the time of writing all the results in memory. To try to get a decrease in the number of instructions, tests on source operands were made to understand how modifying the data path could affect code optimization. First, source and result operands have been expressed as vector of two elements (an element represents the real and another imaginary part).

int A[2]={3,4};
int B[2]={5,6};
int Ris[2];

int main(){
  Ris[0]= (A[0]*B[0])-(A[1]*B[1]);
  Ris[1]= (A[0]*B[1])+(B[0]*A[1]);
  return 0;
}

In this case, the value of the program counter at the time of loading all source operands in the scalar registers was $F4 (hexadecimal value). The second and last store occurred when the PC value was $11C, saving 6 instructions from the previous case. The reason is because, by expressing the operands as vectors, it is not necessary to calculate the effective address of all the elements each time (operation taking 3 instructions), but only by means of the offset (only 1 instruction needed). So it is clear in this case the advantage of expressing the values of the operands in vector form rather than scaling form.

Vector Registers

In order to perform further tests, source operands have been declared as vec16i32 so they use vector registers rather than scalar registers. In the first instance, the implemented program executed a complex product of vectors where the values of the real and imaginary parts are saved in separate vectors, for a total of four vectors with 16 elements (two vectors for the real parts and two for the imaginary parts). The output result is saved in two vectors (real and imaginary parts).

#include <stdint.h>
#include <string.h>
vec16i32 ReA = {3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3};
vec16i32 ImA = {4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4};
vec16i32 ReB = {5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5};
vec16i32 ImB = {6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6};
vec16i32 ReR;
vec16i32 ImR;
int main(){
  ReR= (ReA*ReB)-(ImA*ImB);
  ImR= (ReA*ImB)+(ReB*ImA);
  return 0;
}

In this case, the value of the program counter at the time of loading all source operands in the scalar registers was $104 (hexadecimal value). The second and last store occurred when the PC value was $134. Then, the source operands have been expressed as two matrix of two vectors of 16 elements (a vector for the real parts e a vector for the imaginary parts). The output result is saved in a matrix of 2 vectors.

#include <stdint.h>
#include <string.h>
vec16i32 A[2] = {{3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3},{4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4}};
vec16i32 B[2] = {{5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5},{6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6}};
vec16i32 Ris[2];

int main(){
  Ris[0]= (A[0]*B[0])-(A[1]*B[1]);
  Ris[1]= (A[0]*B[1])+(B[0]*A[1]);
  return 0;
}

In this last case, the value of the program counter at the time of loading all source operands in the scalar registers was $F4 (hexadecimal value). The second and last store occurred when the PC value was $11C, saving 6 instructions from the previous case. This time, you get the same advantages as in the scalar registers, thanks to the saving of instructions obtained calculating the address of the operands through the offset. Therefore it is preferable to express the operands in matrix form, and not as separate vectors.

Threads

Ultimately, matrices of 16 vectors have been used matrix source operands in order to perform a larger number of operations. In this way, you could also assign to threads a subset of operations parallelizing the execution. The two versions realized are composed of 4 matrices of 16 vectors (two matrices contain the real parts and two matrices contain the imaginary parts) and 2 matrices of 32 vectors (the first 16 vectors contain the real parts of a number and the last 16 vectors contain imaginary parts).

#include <stdint.h>
#include <string.h>

vec16i32 Re_A[16] = {
{64,90,89,87,83,80,75,70,64,57,50,43,36,25,18,9},
{64,87,75,57,36,9,-18,-43,-64,-80,-89,-90,-83,-70,-50,-25},
{64,80,50,9,-36,-70,-89,-87,-64,-25,18,57,83,90,75,43},
{64,70,18,-43,-83,-87,-50,9,64,90,75,25,-36,-80,-89,-57},
{64,57,-18,-80,-83,-25,50,90,64,-9,-75,-87,-36,43,89,70},
{64,43,-50,-90,-36,57,89,25,-64,-87,-18,70,83,9,-75,-80},
{64,25,-75,-70,36,90,18,-80,-64,43,89,9,-83,-57,50,87},
{64,9,-89,-25,83,43,-75,-57,64,70,-50,-80,36,87,-18,-90},
{64,-9,-89,25,83,-43,-75,57,64,-70,-50,80,36,-87,-18,90},
{64,-25,-75,70,36,-90,18,80,-64,-43,89,-9,-83,57,50,-87},
{64,-43,-50,90,-36,-57,89,-25,-64,87,-18,-70,83,-9,-75,80},
{64,-57,-18,80,-83,25,50,-90,64,9,-75,87,-36,-43,89,-70},
{64,-70,18,43,-83,87,-50,-9,64,-90,75,-25,-36,80,-89,57},
{64,-80,50,-9,-36,70,-89,87,-64,25,18,-57,83,-90,75,-43},
{64,-87,75,-57,36,-9,-18,43,-64,80,-89,90,-83,70,-50,25},
{64,-90,89,-87,83,-80,75,-70,64,-57,50,-43,36,-25,18,-9}
};

vec16i32 Im_A[16] = {
{64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64},
{90,87,80,70,57,43,25,9,-9,-25,-43,-57,-70,-80,-87,-90},
{89,75,50,18,-18,-50,-75,-89,-89,-75,-50,-18,18,50,75,89},
{87,57,9,-43,-80,-90,-70,-25,25,70,90,80,43,-9,-57,-87},
{83,36,-36,-83,-83,-36,36,83,83,36,-36,-83,-83,-36,36,83},
{80,9,-70,-87,-25,57,90,43,-43,-90,-57,25,87,70,-9,-80},
{75,-18,-89,-50,50,89,18,-75,-75,18,89,50,-50,-89,-18,75},
{70,-43,-87,9,90,25,-80,-57,57,80,-25,-90,-9,87,43,-70},
{64,-64,-64,64,64,-64,-64,64,64,-64,-64,64,64,-64,-64,64},
{57,-80,-25,90,-9,-87,43,70,-70,-43,87,9,-90,25,80,-57},
{50,-89,18,75,-75,-18,89,-50,-50,89,-18,-75,75,18,-89,50},
{43,-90,57,25,-87,70,9,-80,80,-9,-70,87,-25,-57,90,-43},
{36,-83,83,-36,-36,83,-83,36,36,-83,83,-36,-36,83,-83,36},
{25,-70,90,-80,43,9,-57,87,-87,57,-9,-43,80,-90,70,-25},
{18,-50,75,-89,89,-75,50,-18,-18,50,-75,89,-89,75,-50,18},
{9,-25,43,-57,70,-80,87,-90,90,-87,80,-70,57,-43,25,-9}
};

vec16i32 Re_B[16] = {
{64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64},
{90,87,80,70,57,43,25,9,-9,-25,-43,-57,-70,-80,-87,-90},
{89,75,50,18,-18,-50,-75,-89,-89,-75,-50,-18,18,50,75,89},
{87,57,9,-43,-80,-90,-70,-25,25,70,90,80,43,-9,-57,-87},
{83,36,-36,-83,-83,-36,36,83,83,36,-36,-83,-83,-36,36,83},
{80,9,-70,-87,-25,57,90,43,-43,-90,-57,25,87,70,-9,-80},
{75,-18,-89,-50,50,89,18,-75,-75,18,89,50,-50,-89,-18,75},
{70,-43,-87,9,90,25,-80,-57,57,80,-25,-90,-9,87,43,-70},
{64,-64,-64,64,64,-64,-64,64,64,-64,-64,64,64,-64,-64,64},
{57,-80,-25,90,-9,-87,43,70,-70,-43,87,9,-90,25,80,-57},
{50,-89,18,75,-75,-18,89,-50,-50,89,-18,-75,75,18,-89,50},
{43,-90,57,25,-87,70,9,-80,80,-9,-70,87,-25,-57,90,-43},
{36,-83,83,-36,-36,83,-83,36,36,-83,83,-36,-36,83,-83,36},
{25,-70,90,-80,43,9,-57,87,-87,57,-9,-43,80,-90,70,-25},
{18,-50,75,-89,89,-75,50,-18,-18,50,-75,89,-89,75,-50,18},
{9,-25,43,-57,70,-80,87,-90,90,-87,80,-70,57,-43,25,-9}
};

vec16i32 Im_B[16] = {
{64,90,89,87,83,80,75,70,64,57,50,43,36,25,18,9},
{64,87,75,57,36,9,-18,-43,-64,-80,-89,-90,-83,-70,-50,-25},
{64,80,50,9,-36,-70,-89,-87,-64,-25,18,57,83,90,75,43},
{64,70,18,-43,-83,-87,-50,9,64,90,75,25,-36,-80,-89,-57},
{64,57,-18,-80,-83,-25,50,90,64,-9,-75,-87,-36,43,89,70},
{64,43,-50,-90,-36,57,89,25,-64,-87,-18,70,83,9,-75,-80},
{64,25,-75,-70,36,90,18,-80,-64,43,89,9,-83,-57,50,87},
{64,9,-89,-25,83,43,-75,-57,64,70,-50,-80,36,87,-18,-90},
{64,-9,-89,25,83,-43,-75,57,64,-70,-50,80,36,-87,-18,90},
{64,-25,-75,70,36,-90,18,80,-64,-43,89,-9,-83,57,50,-87},
{64,-43,-50,90,-36,-57,89,-25,-64,87,-18,-70,83,-9,-75,80},
{64,-57,-18,80,-83,25,50,-90,64,9,-75,87,-36,-43,89,-70},
{64,-70,18,43,-83,87,-50,-9,64,-90,75,-25,-36,80,-89,57},
{64,-80,50,-9,-36,70,-89,87,-64,25,18,-57,83,-90,75,-43},
{64,-87,75,-57,36,-9,-18,43,-64,80,-89,90,-83,70,-50,25},
{64,-90,89,-87,83,-80,75,-70,64,-57,50,-43,36,-25,18,-9}
};
vec16i32 Re[16];
vec16i32 Im[16];

int main(){
	static int N=16;
	for (int i=0; i<N; i++){
		Re[i]=Re_A[i]*Re_B[i]-Im_A[i]*Im_B[i];
		Im[i]=Re_A[i]*Im_B[i]+Im_A[i]*Re_B[i];
	}

  return 0;
}

#include <stdint.h>
#include <string.h>

vec16i32 A[32] = {
	// Re_A
{64,90,89,87,83,80,75,70,64,57,50,43,36,25,18,9},
{64,87,75,57,36,9,-18,-43,-64,-80,-89,-90,-83,-70,-50,-25},
{64,80,50,9,-36,-70,-89,-87,-64,-25,18,57,83,90,75,43},
{64,70,18,-43,-83,-87,-50,9,64,90,75,25,-36,-80,-89,-57},
{64,57,-18,-80,-83,-25,50,90,64,-9,-75,-87,-36,43,89,70},
{64,43,-50,-90,-36,57,89,25,-64,-87,-18,70,83,9,-75,-80},
{64,25,-75,-70,36,90,18,-80,-64,43,89,9,-83,-57,50,87},
{64,9,-89,-25,83,43,-75,-57,64,70,-50,-80,36,87,-18,-90},
{64,-9,-89,25,83,-43,-75,57,64,-70,-50,80,36,-87,-18,90},
{64,-25,-75,70,36,-90,18,80,-64,-43,89,-9,-83,57,50,-87},
{64,-43,-50,90,-36,-57,89,-25,-64,87,-18,-70,83,-9,-75,80},
{64,-57,-18,80,-83,25,50,-90,64,9,-75,87,-36,-43,89,-70},
{64,-70,18,43,-83,87,-50,-9,64,-90,75,-25,-36,80,-89,57},
{64,-80,50,-9,-36,70,-89,87,-64,25,18,-57,83,-90,75,-43},
{64,-87,75,-57,36,-9,-18,43,-64,80,-89,90,-83,70,-50,25},
{64,-90,89,-87,83,-80,75,-70,64,-57,50,-43,36,-25,18,-9},
	// Im_A
{64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64},
{90,87,80,70,57,43,25,9,-9,-25,-43,-57,-70,-80,-87,-90},
{89,75,50,18,-18,-50,-75,-89,-89,-75,-50,-18,18,50,75,89},
{87,57,9,-43,-80,-90,-70,-25,25,70,90,80,43,-9,-57,-87},
{83,36,-36,-83,-83,-36,36,83,83,36,-36,-83,-83,-36,36,83},
{80,9,-70,-87,-25,57,90,43,-43,-90,-57,25,87,70,-9,-80},
{75,-18,-89,-50,50,89,18,-75,-75,18,89,50,-50,-89,-18,75},
{70,-43,-87,9,90,25,-80,-57,57,80,-25,-90,-9,87,43,-70},
{64,-64,-64,64,64,-64,-64,64,64,-64,-64,64,64,-64,-64,64},
{57,-80,-25,90,-9,-87,43,70,-70,-43,87,9,-90,25,80,-57},
{50,-89,18,75,-75,-18,89,-50,-50,89,-18,-75,75,18,-89,50},
{43,-90,57,25,-87,70,9,-80,80,-9,-70,87,-25,-57,90,-43},
{36,-83,83,-36,-36,83,-83,36,36,-83,83,-36,-36,83,-83,36},
{25,-70,90,-80,43,9,-57,87,-87,57,-9,-43,80,-90,70,-25},
{18,-50,75,-89,89,-75,50,-18,-18,50,-75,89,-89,75,-50,18},
{9,-25,43,-57,70,-80,87,-90,90,-87,80,-70,57,-43,25,-9}
};


vec16i32 B[32] = {
	//Re_B
{64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64},
{90,87,80,70,57,43,25,9,-9,-25,-43,-57,-70,-80,-87,-90},
{89,75,50,18,-18,-50,-75,-89,-89,-75,-50,-18,18,50,75,89},
{87,57,9,-43,-80,-90,-70,-25,25,70,90,80,43,-9,-57,-87},
{83,36,-36,-83,-83,-36,36,83,83,36,-36,-83,-83,-36,36,83},
{80,9,-70,-87,-25,57,90,43,-43,-90,-57,25,87,70,-9,-80},
{75,-18,-89,-50,50,89,18,-75,-75,18,89,50,-50,-89,-18,75},
{70,-43,-87,9,90,25,-80,-57,57,80,-25,-90,-9,87,43,-70},
{64,-64,-64,64,64,-64,-64,64,64,-64,-64,64,64,-64,-64,64},
{57,-80,-25,90,-9,-87,43,70,-70,-43,87,9,-90,25,80,-57},
{50,-89,18,75,-75,-18,89,-50,-50,89,-18,-75,75,18,-89,50},
{43,-90,57,25,-87,70,9,-80,80,-9,-70,87,-25,-57,90,-43},
{36,-83,83,-36,-36,83,-83,36,36,-83,83,-36,-36,83,-83,36},
{25,-70,90,-80,43,9,-57,87,-87,57,-9,-43,80,-90,70,-25},
{18,-50,75,-89,89,-75,50,-18,-18,50,-75,89,-89,75,-50,18},
{9,-25,43,-57,70,-80,87,-90,90,-87,80,-70,57,-43,25,-9},
	// Im_B
{64,90,89,87,83,80,75,70,64,57,50,43,36,25,18,9},
{64,87,75,57,36,9,-18,-43,-64,-80,-89,-90,-83,-70,-50,-25},
{64,80,50,9,-36,-70,-89,-87,-64,-25,18,57,83,90,75,43},
{64,70,18,-43,-83,-87,-50,9,64,90,75,25,-36,-80,-89,-57},
{64,57,-18,-80,-83,-25,50,90,64,-9,-75,-87,-36,43,89,70},
{64,43,-50,-90,-36,57,89,25,-64,-87,-18,70,83,9,-75,-80},
{64,25,-75,-70,36,90,18,-80,-64,43,89,9,-83,-57,50,87},
{64,9,-89,-25,83,43,-75,-57,64,70,-50,-80,36,87,-18,-90},
{64,-9,-89,25,83,-43,-75,57,64,-70,-50,80,36,-87,-18,90},
{64,-25,-75,70,36,-90,18,80,-64,-43,89,-9,-83,57,50,-87},
{64,-43,-50,90,-36,-57,89,-25,-64,87,-18,-70,83,-9,-75,80},
{64,-57,-18,80,-83,25,50,-90,64,9,-75,87,-36,-43,89,-70},
{64,-70,18,43,-83,87,-50,-9,64,-90,75,-25,-36,80,-89,57},
{64,-80,50,-9,-36,70,-89,87,-64,25,18,-57,83,-90,75,-43},
{64,-87,75,-57,36,-9,-18,43,-64,80,-89,90,-83,70,-50,25},
{64,-90,89,-87,83,-80,75,-70,64,-57,50,-43,36,-25,18,-9}
};

vec16i32 Ris[32];

int main(){
	static int N=16;
	for (int i=0; i<N; i++){
		Ris[i]=A[i]*B[i]-A[i+N]*B[i+N];
		Ris[i+N]=A[i]*B[i+N]+A[i+N]*B[i];
	}
  return 0;
}

In this case, the gain in terms of instructions saved due to the calculation of the addresses by offset is not visible because there is an addition of instructions due to the calculation of the indices. Therefore, in the latter case, it is convenient to write the result of source operands as 4 separate matrices instead of as two matrices of double size. The versions made were performed both with a single thread, two threads and four threads. In both versions, the number of instructions in each thread decreases proportionally with the total number of threads (for example, with two threads the instructions halve about). As for the cross verification step, in all the tested versions the correct result was written in memory.

int main(){
	static int N=16;
	int threadId = __builtin_nuplus_read_control_reg(2);
	int numThread = 4;

	for (int i=0; i<N/numThread; i++){
		Re[((N/numThread)*threadId)+i]=Re_A[((N/numThread)*threadId)+i]*Re_B[((N/numThread)*threadId)+i]-Im_A[((N/numThread)*threadId)+i]*Im_B[((N/numThread)*threadId)+i];
		Im[((N/numThread)*threadId)+i]=Re_A[((N/numThread)*threadId)+i]*Im_B[((N/numThread)*threadId)+i]+Im_A[((N/numThread)*threadId)+i]*Re_B[((N/numThread)*threadId)+i];
	}

  return 0;
}

int main(){
	static int N=16;
	int threadId = __builtin_nuplus_read_control_reg(2);
	int numThread = 4;

	for (int i=0; i<N/numThread; i++){
		Ris[((N/numThread)*threadId)+i]=A[((N/numThread)*threadId)+i]*B[((N/numThread)*threadId)+i]-A[((N/numThread)*threadId)+i+N]*B[((N/numThread)*threadId)+i+N];
		Ris[((N/numThread)*threadId)+i+N]=A[((N/numThread)*threadId)+i]*B[((N/numThread)*threadId)+i+N]+A[((N/numThread)*threadId)+i+N]*B[((N/numThread)*threadId)+i];
	}
  return 0;
}

Applications

Contents

Complex Product Kernel

Scalar Registers

Vector Registers

Threads

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools