# Step 1

## Introduction

Step 1 is the first tutorial that will walk you through how to compute an outplace Fourier Transform in parallel. The goal is not just to compute FFTs, but to introduce different functions of the library, and explain some of the concepts. In this step you will get familiar with AccFFT routines and learn a few things about computing distributed, parallel FFTs.

The goal is to compute forward FFT of an array of size $N_0\times N_1\times N_2$ which is distributed in parallel to $P=P_0\times P_1$ processes.

$$\frac{N_0}{P_0}\times \frac{N_1}{P_1}\times N_2$$

That is each process will get $\frac{N_0}{P_0}\times \frac{N_1}{P_1}$ pencils of size $N_2$. This is shown below schematically:

To compute a 3D distributed FFT, AccFFT first computes a batched FFT along the last dimension and then performs a global transpose so that each process gets the second dimension locally. A similar operation is performed again to get the first dimension locally. For this reason the output of forward transform would be distributed as follows:

$$N_0\times \frac{N_1}{P_0}\times \frac{N_2}{P_1}$$

Note that the memory access layout is the same for both input and output. The only difference is that the input array is distributed along the first and second dimension, while the output array is distributed along the second and third. Some of the libraries do change the memory layout as well, but AccFFT does not do that. This makes it very easy for the user to use the same functions for both the input and output arrays, without any changes.

It is very important that you understand how the data is distributed among processes. If the above is not clear for you, please read the AccFFT paper.

## Code Explanation

So first of all, we need to include mpi and accfft header files. The makefile inside step1 folder will provide the location for the “< accfft.h >”, if the library is installed correctly.

Then we have the following function declarations: initialize function initializes the double array data, with the formula provided by testcase function. As you can see, the latter gets the X, Y, Z coordinates and returns the value of a Gaussian centered at $\pi$. That will be the test function that will be used here. There is also a check_err function, which will compute the error in our computations and the main part of the code which is the step1 function.

NaN! Why do you think the if condition at line 11 is needed?

Now let’s take a look at step1 function. First, we get the number of processes and the id of each process among all participating mpi ranks. Then we create a 2d Cartesian grid, named c_comm, using these mpi ranks. The grid sizes of c_comm can be controlled with c_dims. If you do specify a correct grid size with c_dims, the accfft_create_comm function will use those values. Otherwise, it will print an error message and corrects your c_dims and creates c_comm. Now, by correct I mean that the following condition should be met:

$$c\_dims[0]*c\_dims[1]=nprocs$$

This is rather obvious, but most of the time people forget to set it correctly. As you can see, here we do not set it to any value, and we let the library itself to choose them. Next is the declaration for the data and data_hat arrays. data will store the input function, and the FFT of data will be written to data_hat array. The rest of the declarations are for measuring some timing statistics.

Now note that we have an input array that is distributed among a 2d grid of processes. We need to know which block corresponds to each process. This can be found by calling accfft_local_size_dft_r2c function. The inputs to the function is the array size, n, and the 2d grid communicator, c_comm. With this info, it gives you isize, istart, osize, and ostart of the processor calling it. The i refers to input and o refers to output. So isize is the size of the block belonging to the MPI process (the size of the pencil in Figure 1). Similarly, osize is the output size that this process gets after the FFT is performed (the size of the pencil in Figure 2). The istart and ostart arrays provide you with the exact starting location of the block that this process owns. So for example, consider a 256x256x256 input with 4 processors that are decomposed into a 2x2 grid. Process zero will get a block of size 128x128x256 for the isize. The istart for process zero will be (0,0,0), and for process one, it will be (0,64,0).

Now the story is the same for the osize, but with a small little difference. For real-to-complex transforms, the output is a little larger than the input. This is because of the Nyquist frequency. This means that a real-to-complex FFT of a $N_0\times N_1\times N_2$ array will be $N_0\times N_1\times (N_2/2+1)$ complex numbers. So for an array of size 256x256x256, the FFT will be an array of 256x256x130 complex numbers. Now, according to Figure 2 this array is distributed in the second and third dimensions. So for the above 4 processor example, process zero will get osize of (256,128,65) complex numbers.

To make your life easier, the accfft_local_size_dft_r2c computes how much allocation is required for the output array. You can also compute the allocation size of the input array using isize.

Always use this approach to allocate the output array as the required size may not be the same as the input array. Moreover, for strange processor grid cases the allocation size might be larger than what you get from the osize array. So be safe, fasten your seat belt and use the alloc_max!

Next we initialize the library with accfft_init function. The function gets an integer to set the number of OpenMP threads. In general, for small array sizes you will not get a noticeable speedup with OpenMP. So just set it to one is ok. However, for very large transform sizes, you may benefit from it. Next we setup a plan using accfft_plan_dft_3d_r2c_c2r funciton. This function will setup different FFT plans, global transpose functions, etc. As you can see, the function gets a flag as its last argument. The flag is used to test different routines to optimize performance. You can use the following options:

1. ACCFFT_ESTIMATE
2. ACCFFT_MEASURE
3. ACCFFT_PATIENT

Note that the third option will considerably increase the setup time. The default option is ACCFFT_MEASURE

To simplify everyone’s life (and save memory) you do not need to create a separate plan for complex-to-real transform. The only exception is the rare case where you have to apply the FFT plan to an array with a different allocation layout. This would never be the case if you use accfft_alloc functions. But if you have to do it, then you can simply create two plans using the same function

After all of this we call the initialize function, which sets the input array to a Gaussian centered at $\pi$.

Next we perform the FFT transform! This is simply done by calling accfft_execute_r2c function. As you know, if we perform and IFFT transform on the output, we should get the original array with a scaling. So let’s do that and check whether we get the original array or not. We can perform the IFFT transform using the accfft_execute_c2r function. We write the output to data2 array.

Now the difference between data and data2 array must be zero if everything is computed correctly. However, as mentioned in the previous article, there will be a scaling required for data2.

As you can see, we need to compute the global difference between the two arrays, so an MPI reduction is performed to compute the max error. You could have also checked the error using check_err(data2,n,c_comm) command. It calls the test function to get the analytical form of the solution and compares it to data2. The rest of the code is straight forward. We perform similar reductions to compute the maximum time that each section took and print them to stdout.

PCOUT is the same as std::cout, with the difference that only the root process will write the output.

## GPU Code

The GPU version of the code can be found in step1_gpu.cpp file. The implementation is very similar to the CPU, except that most of the function names need to be appended by _gpu. For example, note that the code in using accfft_gpu.h instead of accfft.h. Or that the FFT plan type is accfft_plan_gpu instad of accfft_plan. That’s it on the user side! Everything else is taken care in the background!

The step1_gpu code first allocates and initializes data on the cpu, and writes it into data_cpu. Then this data is transferred to a GPU buffer called data. The data transfer is not necessary, and you can simply use a cuda kernel to initialize the GPU array (look at initialize_gpu function in kernels.cu). After this step, the forward and backward FFT operations are performed on GPU. Finally, the result is transferred back to the CPU to check the error (Obviously this can be done on the GPU as well.)

Note that the design of the library allows you to use both the CPU and GPU functionalities simultaneously. For example, you can run some FFTs on CPU and some other FFTs on GPU. You only need to include both accfft.h and accfft_gpu.h, and use the corresponding functions.

## Results

You can make and run the code as follows:

Whether you need to use ibrun to mpirun to spawn mpi processes depends on your system. The np option sets the number of processes to 4 and the last three inputs set the size of the array. The executable prints the following when run on one node of Maverick:

The ERROR is because we did not set c_dims and the library corrects it for us and sets it to 2x2. Running the same configuration on one K40 GPU of Maverick gives the following result:

Note that here I am using only one GPU, but still the problem is divided into four smaller ones. You can try running it on four seperate GPUs on different nodes and compare the timings. I also suggest testing different problem sizes, and comparing CPU and GPU times. In general, the GPU will be faster for large problem sizes.