CUDA Keywords

本文介绍CUDA作为NVIDIA的通用并行计算架构,重点讲述了其硬件结构,包括流处理器单元、内存架构等,并对比了CPU与GPU在执行任务上的差异。此外还介绍了CUDA的开发工具包、编译流程及两种API。
部署运行你感兴趣的模型镜像
This article is only a memo for I got a general idea about CUDA.
(Refer: http://www.pcinlife.com/article/graphics/2008-06-04/1212575164d532.html)

CUDA is the GPGPU model of nVidia.
Shader unit:
multiprocessors -- 8 stream processors -- 8192 Registers, 16KB share memory, texture cache, constant cache
info: cudaGetDeviceProperties(), cuDeviceGetProperties()

Each stream processors: FMA (fused-multiply-add) Unit, executes add or multiply function

Wrap(32 threads) = 2 * half-wrap(16 threads)

Pros: higher memory bandwidth, More execution units, cheaper...
Cons: No use for the task only can be sequential executed, only support 32bit float now, not good in branch program, different standard between nVidia and AMD/ATI

CPU:Host, GPU:Device

grid -- block -- thread
grid: (individual) global memory, constant memory, texture memory
thread: (individual) register, local memory
(in block) shared memory
(out block) global memory, constant memory, texture memory

Shared memory(16KB in each multiprocessor): divided into banks(16 banks), each bank is 4 bytes
Global memroy: coalesced
Texture: texture filtering


Differences with CPU: Latency, Branch code

CUDA Tookit: http://www.nvidia.com/object/cuda_get.html

Compile: nvcc

Two different API: Runtime API(easier), Driver API

cudaMalloc, cudaMemcpy -- malloc, memcpy
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
__global__ , __shared__ , __syncthreads()
bank conflict

clock: timestamp

Arithmetic Unit in Stream processor: a float's fused multiply-add unit

cudaMallocHost: page locked

您可能感兴趣的与本文相关的镜像

PyTorch 2.5

PyTorch 2.5

PyTorch
Cuda

PyTorch 是一个开源的 Python 机器学习库,基于 Torch 库,底层由 C++ 实现,应用于人工智能领域,如计算机视觉和自然语言处理

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值