CUDA (Compute Unified Device Architecture) is a parallel computing platform and first programming model that enabled high level programming for GPUs (Graphics Processing Units) and thereby made them available for general purpose computing. Previously, if one had to use GPUs for doing general purpose computation then they had to use the Graphics API provided by OpenGL, DirectX and other related graphics APIs and map there computation onto them. All these issues was overcome by CUDA and so now GPUs are also been called GPGPUs that is General Purpose computing on Graphics Processing Units. So writing code in CUDA programmers can speed up their algorithms to a massive extent due to amount of parallelism offered by GPUs.
The figure above shows the architecture of a CUDA capable GPU. It consists of multiple numbers of Stream Multiprocessors(SMs) which contains several Streaming Processors (SPs) ( CUDA cores ) in them, both the number of SMs and CUDA cores depends upon the type, generation of the GPU device. The memory hierarchy from low speed to high speed is in this order Global memory(Device memory), Texture memory, Constant memory, Shared memory, Registers. So, the fastest is access from registers and slowest is global memory. Hence, CUDA programmer need to write their code considering this hierarchy in order to gain maximum performance benefit.