Set gpu tpb#736
Conversation
…. Yes, it's confusing. Yes, the OpenMP ARB know.
…they would be easy.
|
Rudimentary testing done with: #include <cstdio>
#include "quest.h"
int main (void)
{
const int NQUBITS = 24;
const int TPB = 32;
initQuESTEnv();
reportQuESTEnv();
std::printf("Initial number of threads per block: %d\n", getQuESTGpuThreadsPerBlock());
setQuESTGpuThreadsPerBlock(TPB);
std::printf("New number of threads per block: %d\n", getQuESTGpuThreadsPerBlock());
Qureg qureg = createForcedQureg(NQUBITS);
std::printf("Initialising Qureg.\n");
initPlusState(qureg);
reportQureg(qureg);
std::printf("Applying Quantum Fourier Transform.\n");
applyFullQuantumFourierTransform(qureg, false);
reportQureg(qureg);
destroyQureg(qureg);
finalizeQuESTEnv();
return 0;
} |
|
Why would |
|
Is there an advantage to users having to set this as a runtime hyperparameter? My (mostly undeveloped) belief is we can use occupancy tools (alluded to here) to automate this. I definitely shy from giving users a greater onus to optimise for their settings (like other prolific softwares), which the v4 overhaul was supposed to avoid (via e.g. the autodeployer). Note too that the kernels so far are very primitive - each thread handles the updating of the minimum possible number of amplitudes (often just one!). I quite like that because it's very readable and simple (great for an open-source scientific project) but is an obvious site for optimisation.
It's true that it will never be anywhere as big as the quantities |
|
Hi Tyson, I just noticed the fixed value to 128 and have a feeling that it was large. I just wanted a handle so I could write a benchmark so we can easily automate performance tuning ourselves. I have not played with the occupancy tools but I should take a proper look as this might solve this automatically. My other concern is that there are differences between Nvidia and AMD on optimal sizes due to hardware differences so we might not be able to reply on the occupancy tuning in all cases unless this becomes available on all platforms. |
|
|
||
| qindex numThreads = qureg.numAmpsPerNode / powerOf2(qubits.size()); | ||
| qindex numBlocks = getNumBlocks(numThreads); | ||
| const int NUM_THREADS_PER_BLOCK = gpu_getNumThreadsPerBlock(); |
There was a problem hiding this comment.
If we opt for this, why is NUM_THREADS_PER_BLOCK capitalised like a constant? It's runtime
There was a problem hiding this comment.
I agree capitalisation here bad.
There was a problem hiding this comment.
It's const in scope 😉 apologies, accidentally following my own style guide there rather than the QuEST one. I'll %s/NUM_THREADS_PER_BLOCK/numThreadsPerBlock/g it.
I guess it's very GPU specific! I think For illustration, the next smallest size is Of course, newer GPUs support more active blocks per SM (even when the max active threads per SM is unchanged). E.g. CC=8 supports up to 32 active blocks per SM, so we could shrink to Certainly seems prudent to consult a CUDA runtime API, if that doesn't hurt our AMD compatibility! |
|
Apologies, probably won't get to look at this again this week, but very happy to set this value programmatically if it can be done! As it's architecture dependent, we definitely do need a way to adjust it, and ideally both at runtime and compile time. At compile time, so kindly HPC support teams can compile and maintain a tuned version, and at runtime, so they can scan through values without having to recompile in between. I'll have a chat with James abour approaches later this week! I 100% agree that we don't really want unknowing users messing around with this. I think something like an |
|
Fair enough - you've convinced me! Being able to runtime adjust is of course extremely helping during development of a user-friendlier adaptive system anyhow. I like the sound of |
| } | ||
|
|
||
| void setQuESTGpuThreadsPerBlock(const int NEW_TPB) { | ||
| // just rely on the internal function to throw an error if there's no GPU support compiled |
There was a problem hiding this comment.
TODO: validate this is a factor of 32 (and is positive, etc etc)
There was a problem hiding this comment.
Doc to user: HIP warpsize is 64!
There was a problem hiding this comment.
Maybe a better alternative: add gpu_isHipCompiled() in gpu_config.cpp, right under gpu_isCuQuantumCompiled(), as:
bool gpu_isHipCompiled() {
return (bool) (COMPILE_CUDA && defined(__HIP__));
}Then we can validate explicitly that when GPU-accelerated and we're on HIP, arg must be a multiple of 64, else of 32. This means 32 is required even when not GPU-accelerated; so we make that error message:
The number of threads per block must be a multiple of 32 (or on AMD GPUs, 64)
|
Should validate TPB is multiple of 32! |
|
|
||
|
|
||
| int getQuESTGpuThreadsPerBlock() { | ||
| QuESTEnv env = getQuESTEnv(); |
There was a problem hiding this comment.
Note getQuESTEnv() is an API function with its own validation, and shouldn't be called internally like this since if its validation throws, it will claim the user called getQuESTEnv(), when they actually called getQuESTGpuThreadsPerBlock().
Should therefore first call
validate_envIsInit(__func__);
and subsequently use globalEnvPtr directly.
(I see getEnvironmentString() calls getQuESTEnv() for some reason, when it too should just use globalEnvPtr)
| * This is somehow probably the best pre-existing place for this. It only really applies to GPU, because for | ||
| * OpenMP the user can just export OMP_NUM_THREADS or call omp_set_num_threads. | ||
| */ | ||
| int getQuESTGpuThreadsPerBlock(); |
There was a problem hiding this comment.
Should we include Num somewhere, e.g.
getNumQuESTGpuThreadsPerBlockgetQuESTNumGpuThreadsPerBlock
I've so far tried to avoid abbreviating where feasible.
| * OpenMP the user can just export OMP_NUM_THREADS or call omp_set_num_threads. | ||
| */ | ||
| int getQuESTGpuThreadsPerBlock(); | ||
| void setQuESTGpuThreadsPerBlock(const int NEW_TPB); |
There was a problem hiding this comment.
(Same Num) consideration as for getQuESTGpuThreadsPerBlock)
| } | ||
|
|
||
| void setQuESTGpuThreadsPerBlock(const int NEW_TPB) { | ||
| // just rely on the internal function to throw an error if there's no GPU support compiled |
There was a problem hiding this comment.
Should also call
validate_envIsInit(__func__);
|
|
||
| int getQuESTGpuThreadsPerBlock() { | ||
| QuESTEnv env = getQuESTEnv(); | ||
| return env.isGpuAccelerated? gpu_getNumThreadsPerBlock() : 0; |
There was a problem hiding this comment.
Hmm I think this is a pitfall. If setQuESTGpuThreadsPerBlock() is permitted in non-GPU mode (and I really believe it should for healthy platform agnosticism), then the user would always get back 0 in lieu of what they had just passed to set. Maybe we should just always return gpu_getNumThreadsPerBlock().
The situation is slightly different to the GPU cache (fetchable by getGpuCacheSize() and clearable via clearGpuCache()), because that offers no setter. Users can always safely call both in non-GPU accelerated mode, and the former will return 0 (which is always correct).
| return env.isGpuAccelerated? gpu_getNumThreadsPerBlock() : 0; | ||
| } | ||
|
|
||
| void setQuESTGpuThreadsPerBlock(const int NEW_TPB) { |
There was a problem hiding this comment.
NEW_TPB -> numThreadsPerBlock or numTPB, etc
| #include "quest/src/gpu/cuda_to_hip.hpp" | ||
| #endif | ||
|
|
||
| int numThreadsPerBlock = 128; |
There was a problem hiding this comment.
Should give this a global_ prefix, like here (like I failed to do for hasGpuBeenBound, oops!)
Further, given it's not accessed anywhere outside gpu_(g|s)etNumThreadsPerBlock(), I would move this definition to the ENVIRONMENT MANAGEMENT section, just before gpu_getNumThreadsPerBlock().
| error_gpuQueriedButGpuNotCompiled(); | ||
| #endif | ||
| return; | ||
| } |
There was a problem hiding this comment.
If we permit users to call the corresponding API functions when GPU acceleration is not enabled, then these guards can be removed entirely. I think that's fair/natural, because we certainly shouldn't introduce an API difference between compiling but not running with GPU acceleration.
I would also comment this exception. So this could become:
int gpu_getNumThreadsPerBlock() {
// permitted even when GPU backend not compiled
return globlal_numThreadsPerBlock;
}
void gpu_setNumThreadsPerBlock(const int newThreadsPerBlock) {
// permitted even when GPU backend not compiled
global_numThreadsPerBlock = newThreadsPerBlock;
}|
|
||
|
|
||
| __host__ qindex getNumBlocks(qindex numThreads) { | ||
| __host__ qindex getNumBlocks(qindex numThreads, const int numThreadsPerBlock) { |
There was a problem hiding this comment.
I would remove the const qualifier since it's asymmetric and a bit counterproductive, because it makes the reader why why isn't numThreads const
|
|
||
| qindex numThreads = qureg.numAmpsPerNode / powerOf2(qubits.size()); | ||
| qindex numBlocks = getNumBlocks(numThreads); | ||
| const int numThreadsPerBlock = gpu_getNumThreadsPerBlock(); |
There was a problem hiding this comment.
I think the const before numThreadsPerBlock in gpu_subroutines.cpp are superfluous and only cause reader confusion (they might incorrectly infer that only the 2nd arg of <<< needs to be const, or something). Context already make the constness obvious
|
Rather than attempt to post a thumbs up on each comment while on train WiFi, I'll just comment thanks @TysonRayJones here! I'm hoping to give this branch some proper attention on Thursday/Friday. |
Creating a facility for users to runtime set threads per block for tuning the GPU implementation. NOTE: only applies to kernels that are not handled by Thrust, which does its own thing. Resolves #735.
I considered and rejected the idea of creating a symmetric interface for the CPU for users who don't know
OMP_NUM_THREADSoromp_set_num_threads()exist, but that's much riskier as the point of truth is external (in the OpenMP runtime).TODO:
Should gpu_getNumThreadsPerBlock return a.qindex? ProbablyQuEST/quest/src/gpu/gpu_subroutines.cpp
Line 453 in b7d4a29