Posts GPU
Post
Cancel

GPU

Hanlding cuda error messages

http://cuda-programming.blogspot.com/2013/01/handling-cuda-error-messages.html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
void cuda_safe(cudaError_t error, char* message) {
    if (error != cudaSuccess) {
        printf("ERROR: %s : %i\n", message, error);
        exit(-1);
    }
}

void Check_CUDA_Error(const char* message) {
    cudaError_t error = cudaGetLastError();
    if (error != cudaSuccess) {
        printf("ERROR: %s : %s\n", message, cudaGetLastErrorString(error));
        exit(-1);
    }
}

int main() {
    ...
    cuda_safe(cudaMalloc(...), "cudaMalloc");
    ...
    cuda_safe(cudaFree(...), "cudaFree");
    ...
    Check_CUDA_Error("Kernel Launch Failed!");
    
    return 0;
}

常见错误:

  • error: a host function call can not be configured.

  • Invalid Configuration Argument.

    This error means that the number of registers available on the multiprocessor is being exceeded. Reduce the number of threads per block to solve the problem.

  • Too Many Resources Requested for Launch.

  • Unspecified launch failure.

如何在cuda c/c++中实现数据传输的重叠

H100相对以前的架构(比如A100),引入了Cluster。

Thread—>Block—>Grid ==> Thread—>Block—>Cluster—>Grid.

  • before:

    ​ Block与Block之间进行数据交换是通过Global Mem。

  • after

    ​ Block与Block之间进行数据交换是通过Shared Mem。

cudaMemcpyToSymbol

拷贝方式的不同是由目的内存申请的方式决定的。

申请的是device内存,cudaMemcpyToSymbol拷贝就是从host拷贝到global memory。

申请的是constant内存,cudaMemcpyToSymbol拷贝就是从host拷贝到constant memory。

__device__ float g_damp_x[MAX_DIM]; __constant__ float c_damp_y[MAX_DIM];

针对half类型数据的Math函数在cuda_fp16.h里。比如:

1
__device__ __half hsqrt(const __half a)

#pragma unroll 4

​ grid_sampler_grad_impl.cu

_Pragma("unroll")

​ rcwm_small_impl.cu

使用CUDA Warp-Level级原语

​ 知乎:使用CUDA Warp-Level级原语

​ add_relu_v2_impl.cu

__ballot_sync(mask, __activemask())

__shfl_down_sync

三维 block

​ adaptive_max_pool2d_impl.cu

​ dim3

Thrust, the CUDA C++ template library

​ non_zero_impl.cu

​ batchnorm_fold_impl.cu

​ batchnorm_fold2_impl.cu

unique_consecutive_impl.cu

share memory

maxSharedMemoryPerBlock

​ 静态: batchnorm_grad_impl.cu

​ bias_add_grad_impl.cu

​ topk_impl.cu

​ 动态: multinomial_impl.cu

​ layer_norm_impl.cu

​ layer_norm_grad_impl.cu

​ layer_norm_grad_grad_impl.cu

​ transpose_impl_opt.cu

Nsight Compute

​ 知乎:使用Nsight Compute进行分析驱动的优化

This post is licensed under CC BY 4.0 by the author.

-

读薄《Professional CUDA C Programming》——基于CUDA的异构并行编程