Monday, 17 June 2013


Posted by Mahesh Doijade
cuda, CUDA, what is cuda

      CUDA (Compute Unified Device Architecture) is a parallel computing platform and first programming model that enabled high level programming for GPUs (Graphics Processing Units) and thereby made them available for general purpose computing. Previously, if one had to use GPUs for doing general purpose computation then they had to use the Graphics API provided by OpenGL, DirectX and other related graphics APIs and map there computation onto them. All these issues was overcome by CUDA and so now GPUs are also been called GPGPUs that is General Purpose computing on Graphics Processing Units. So writing code in CUDA programmers can speed up their algorithms to a massive extent due to amount of parallelism offered by GPUs.
      One can start programming in CUDA using extensions provided for C, C++ and Fortran. So, NVIDIA’s basic CUDA setup consists of 3 parts: The NVIDIA graphics card drivers, the CUDA SDK, and the CUDA Toolkit. The drivers are available for most Linux distributions as well as for Windows.  For development environment consisting of compiler, tools for debugging and performance optimization one need to download the CUDA Toolkit. For other basic requisite stuff such as sample examples, different basic image processing, primitives for linear algebra operations related libraries based on CUDA one can get it from CUDA SDK(Software Development Kit).  CUDA platform allows creation of very large number of concurrently executed threads at very low system resource cost.
CUDA, cuda, CUDA, cuda, CUDA, cuda, CUDA, cuda, CUDA, cuda, CUDA, cuda, CUDA, cuda, CUDA, cuda, CUDA, cuda
    The figure above shows the architecture of a CUDA capable GPU. It consists of multiple numbers of Stream Multiprocessors(SMs) which contains several Streaming Processors (SPs) ( CUDA cores ) in them, both the number of SMs and CUDA cores depends upon the type, generation of the GPU device. The memory hierarchy from low speed to high speed is in this order Global memory(Device memory), Texture memory, Constant memory, Shared memory, Registers. So, the fastest is access from registers and slowest is global memory. Hence, CUDA programmer need to write their code considering this hierarchy in order to gain maximum performance benefit.


  1. Great One...Please read this one and let me know your thoughts on same.

    try this