Step 2

Introduction

Step 2 is the second tutorial that will walk you through how to compute an inplace real-to-complex (R2C) FFT. There is a subtlety here, in that the result of R2C FFT requires a larger memory than its input. As a result, the user needs to allocate additional memory for the input so that the extra data would fit in the memory. However, that is not all; the user has to use a special padded access pattern in the spatial domain. First an explanation of why there is a memory overhead:

The complex result of a R2C FFT is conjugate symmetric. What this mean is that half of the output data is equal to the conjugate of the other half. So it is really not required to compute those frequencies. This property saves half of the computations and requires half memory. The exact computational cost of computing all the frequencies of a power 2 FFT is $5N\log N$, and using this property it would reduce to $2.5N\log N$.

So where is the memory padding? The devil is in the Nyquist frequency! The output of an $\small N_0\times N_1\times N_2$ R2C FFT would be $\small N_0\times N_1\times (N_2/2+1)$ complex numbers. That means you need to allocate $\small N_0\times N_1\times (N_2/2+1) \times 2\times sizeof(double)$ memory, which is larger than $\small (N_0\times N_1\times N_2)\times sizeof(double)$.

Note that if we had not used the conjugate symmetry property you had to allocate $(N_0\times N_1\times N_2)\times 2\times sizeof(double)$, which is almost twice the above.

AccFFT gives you the right amount of memory to allocate, by calling accfft_local_size_dft_r2c function. So this is already done, however the user must use a special format for initializing the data. The additional allocated memory is stored as if the input matrix was of size $N_0\times N_1\times \bf{(N_2+2)}$. The last dimension is still $N_2$, but when one wants to access elements in memory this padding must be respected.

For example, to access/initialize the (i,j,k) element of the data in the spatial domain (i.e. before doing the R2C transform), one should access $(i\times N_1 + j)\times (N_2+2) +k$.

As you would see in Step 3, there is no such issue for inplace complex-to-complex transforms. That is because, there is no conjugate property for that case!

Code Explanation

The code for this step is very similar to step1, so I would just point out the important differences.

First note that the allocation size is alloc_max which is returned by calling accfft_local_size_dft_r2c function:

This ensures that the data on each process has enough memory to store the output of R2C transform. Moreover, note that the library would create an inplace transform if the second and third arguments of accfft_plan_dft_3d_r2c point to the same memory address (Otherwise an outplace transform would be created).

Finally, note that initialize and check_err functions are a little different than step1. The padded version of $N_2$ is used to read/write data into the spatial array.

Results

You can make and run the code as follows:

On my machine, I get an output like this:

Again, the ERROR is because we were lazy and did not set c_dims. The library corrects it for us and sets it to an appropriate value.

I suggest you run the program with different sizes and different number of MPI processes, and compare the timings with that of step 1.

GPU Code

The GPU version of the code can be found in step2_gpu.cpp file. Note that in this step, I use a GPU kernel to initialize the data instead of first initializing it on CPU and copying it over.