Computational Science Asked on October 23, 2021
In my simulations I am using dense matrix-vector multiplications and 2D-fft transformations quite often, for matrix sizes of 8kx8k and up. Hence, I assume that using a GPU is beneficial for speeding up my code.
The problem is, though, that my development PC does not have an external GPU, and does not support the addition of one. Buying a new PC for development which can support a GPU is currently not possible. Therefore, my current approach is to use ArrayFire, which allows to switch the backend depending on the library I link, and thereby allowing me to write and run code on the CPU for testing, but still making it possible to switch to CUDA/OpenCL during production.
Nevertheless, I was wondering if there are other, maybe better alternatives? I looked at kokkos, for example, but there I would have to write my own wrapper for FFTs.
Or should I rather switch to a completely different approach for solving those problems?
Edit: The code was written in C++, thus I’d like to avoid having to switch to other languages.
ArrayFire has a C++ API as well as a Python API. You can switch between several backends including CPU, CUDA, and OpenCL. It will also handle memory movement and kernel fusion for you. An example:
/*******************************************************
* Copyright (c) 2014, ArrayFire
* All rights reserved.
*
* This file is distributed under 3-clause BSD license.
* The complete license agreement can be obtained at:
* http://arrayfire.com/licenses/BSD-3-Clause
********************************************************/
#include <arrayfire.h>
#include <math.h>
#include <stdio.h>
#include <cstdlib>
using namespace af;
// create a small wrapper to benchmark
static array A; // populated before each timing
static void fn() {
array B = fft2(A); // matrix multiply
B.eval(); // ensure evaluated
}
int main(int argc, char** argv) {
try {
setBackend(AF_BACKEND_CPU);
//setBackend(AF_BACKEND_CUDA); //Choose one!
info();
printf("Benchmark N-by-N 2D fftn");
for (int M = 7; M <= 12; M++) {
int N = (1 << M);
printf("%4d x %4d: ", N, N);
A = randu(N, N);
double time = timeit(fn); // time in seconds
double gflops = 10.0 * N * N * M / (time * 1e9);
printf(" %4.0f Gflopsn", gflops);
fflush(stdout);
}
} catch (af::exception& e) { fprintf(stderr, "%sn", e.what()); }
return 0;
}
```
Answered by Richard on October 23, 2021
One way to do this is to use Julia. Julia's CUDAnative.jl allows for automated recompilation of pretty general code to GPUs using the LLVM PTX backend. It just works on standard Julia code, so types, dispatches, etc. are all fine: most cases you shouldn't have to alter your code from the original to make it work. This has demonstrated to be performance competitive and many times even better than the original CUDA compiler, so it's a nice high level but fast environment to work in.
On top of that, there's many layers of abstraction one could use. GPUifyLoops.jl has some extra tooling for if you want to just compile scalar loops, and KernelAbstractions.jl is the next iteration of this. Additionally, if you just want array primitives, you can make use of CuArrays.jl which will make use of BLAS and all of that. Additionally, it tends to perform very well in comparison to other high level GPU libraries since it's not just calling preconstructed kernels but is rather using the codegen tools, so for example E .= A .* B .+ C .+ sin.(D)
will generate and compile a single non-allocating GPU kernel instead of calling 5 kernels and allocating temporaries like in things like cuPy, PyTorch, etc. (since those are calling pre-written CUDA code on binary operators). Julia also has a stack like this for AMD GPUs with AMDGPUnative.jl and ROCArrays.jl.
You can also use ArrayFire.jl if you wanted to stick to ArrayFire, but it doesn't perform codegen so it might not be as performant in all cases.
The nice thing about the Julia stack is that tools in Julia generally are compatible, which means you can take these arrays and use them in other existing codes! So for example, if you make your initial condition to DifferentialEquations.jl be a CuArray, then the whole differential equation solver recompiles to perform each of its actions on the GPU. Thus in many cases (i.e. cases where you weren't directly defining scalar indexing), moving to the GPU is simply calling cu(x)
on the input to a function.
Now, the thing to be more generally worried about with OpenCL is the performance of the kernels. In 2020, there's still some major advantages to using CUDA. CuBLAS(xt) is quite well optimized, and then CuDNN really doesn't have a substitute. What this means is that, even if a card is rated fast, it doesn't mean that it will be as fast as using the latest NVIDIA card with CUDA simply because if all of the time is spent in a convolution kernel (i.e. a convolutional neural network), then CuDNN can give a flat 10x speedup over current alternatives, and that's the reason why people stick with CUDA, not necessarily the hardware. That said, at this point other BLAS implementations are okay (not on par, but okay), so using OpenCL to do a bunch of matmuls is fine (and you can dig up loads of performance numbers on this)
Answered by Chris Rackauckas on October 23, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP