Hanlding cuda error messages
http://cuda-programming.blogspot.com/2013/01/handling-cuda-error-messages.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
void cuda_safe(cudaError_t error, char* message) {
if (error != cudaSuccess) {
printf("ERROR: %s : %i\n", message, error);
exit(-1);
}
}
void Check_CUDA_Error(const char* message) {
cudaError_t error = cudaGetLastError();
if (error != cudaSuccess) {
printf("ERROR: %s : %s\n", message, cudaGetLastErrorString(error));
exit(-1);
}
}
int main() {
...
cuda_safe(cudaMalloc(...), "cudaMalloc");
...
cuda_safe(cudaFree(...), "cudaFree");
...
Check_CUDA_Error("Kernel Launch Failed!");
return 0;
}
常见错误:
error: a host function call can not be configured.
Invalid Configuration Argument.
This error means that the number of registers available on the multiprocessor is being exceeded. Reduce the number of threads per block to solve the problem.
Too Many Resources Requested for Launch.
Unspecified launch failure.
H100相对以前的架构(比如A100),引入了Cluster。
Thread—>Block—>Grid ==> Thread—>Block—>Cluster—>Grid.
before:
Block与Block之间进行数据交换是通过Global Mem。
after
Block与Block之间进行数据交换是通过Shared Mem。
cudaMemcpyToSymbol
拷贝方式的不同是由目的内存申请的方式决定的。
申请的是device内存,cudaMemcpyToSymbol拷贝就是从host拷贝到global memory。
申请的是constant内存,cudaMemcpyToSymbol拷贝就是从host拷贝到constant memory。
__device__ float g_damp_x[MAX_DIM];
__constant__ float c_damp_y[MAX_DIM];
针对half类型数据的Math函数在cuda_fp16.h
里。比如:
1
__device__ __half hsqrt(const __half a)
#pragma unroll 4
grid_sampler_grad_impl.cu
_Pragma("unroll")
rcwm_small_impl.cu
使用CUDA Warp-Level级原语
知乎:使用CUDA Warp-Level级原语
add_relu_v2_impl.cu
__ballot_sync(mask, __activemask())
__shfl_down_sync
三维 block
adaptive_max_pool2d_impl.cu
dim3
Thrust, the CUDA C++ template library
non_zero_impl.cu
batchnorm_fold_impl.cu
batchnorm_fold2_impl.cu
unique_consecutive_impl.cu
share memory
maxSharedMemoryPerBlock
静态: batchnorm_grad_impl.cu
bias_add_grad_impl.cu
topk_impl.cu
动态: multinomial_impl.cu
layer_norm_impl.cu
layer_norm_grad_impl.cu
layer_norm_grad_grad_impl.cu
transpose_impl_opt.cu
Nsight Compute
知乎:使用Nsight Compute进行分析驱动的优化