Monday, 24 March 2014

Hack The New Programming Language

Posted by Amit Khomane 2 Comments
Hack Programming Language,  Facebook Hack Programming Language
Any Programming language have 3 main features. Performance, Productivity and Generality. We always have to give up on one of the three. People always try to create a so called super-language which meets all three features and that leads to creation and modification of languages. Some other points that lead to creation of languages are,

  • People take ideas from different languages and combine them into a new languages.
  • Some features improved, Some added and some removed of existing language.
  • Programmers start using a language in a particular way, Language designers identify some usage patterns and introduce new abstractions to support that patterns.
  • Some languages are designed to support particular domains.
  • People thinks they can improve on existing products.

Now Facebook jump into programing language creation. Google is already into it. Google have developed GO (aka golang) and Dart Web programming language. Facebook have unveiled language called Hack (nothing to have with hacking). What made them go for Hack? Lets get deep into it to understand more about it.

What is Hack?

As said earlier, its programming language invented by Facebook(introduced on March 20, 2014, that too young). Hack is for the HipHop Virtual Machine (HHVM). Now what's that? In simplest context It is PHP execution engine and improvement. The HHVM is a machine designed to execute programs written in Hack and PHP. Its main motive is to increase speed of the PHP application. Hack is, in simpler terms, new version of PHP.

Why Hack?

I think point that would attract PHP developer is, it interoperates seamlessly with PHP. The highlighting point of the Hack is static typing. Ok, I know bit about static typing but Why and how does that matter.
  • Static Typing:
Static typing languages required to carefully define your variable types. Annoying to many developers right?(specially who do scripting like PHP which is dynamically typed). But where does it benefits. It requires fewer server to run your code and it’s easier to manage your code. Since typing itself gives you documentation you need for collaborative development. You no need to explain your code to other developers for its typing.  
So Hack provide static typing? YES and still Interoperate with PHP How? The answer is  gradual typing. It is both static and dynamic typing.(Now that make a sense).

Other Point to highlight about Hack are,
  • Generics:
It allows classes and methods to be parameterized, a type associated when a class is instantiated or a method is called.

  • Collections:
It enhances the experience of working with PHP arrays. Hack implemented some collection types such as, Vector, Map, Set, Pair.
There are few more features such as, Lambdas, Shapes, Type Aliasing, Async, Continuations, Override Attribute, Method Dispatch and few more.
Tested and Proven:

So next thing comes to mind, is it tested and proven? As Facebook stated, we had already implemented the code and "battle tested" it on a large portion of Facebook's site. So we can consider it safe and tested (If we believe what they stated).

PHP to Hack:

Running your PHP project as Hack (is it possible?). Hack provide some tools that will be used to convert your code into Hack. It is automated conversion while running on HHVM so you no need to be worried about other things than your logic.

Hack is creating buzz with some of the top software developer’s appreciation. But some do question about it being new language or update to PHP.
Read More

Sunday, 16 March 2014

Why different flavors of Python

Posted by Unknown 5 Comments

Why different flavors of python, Python, Implementations of Python

    We always want different flavors in everything, don’t we? As they say varieties add spice to life. So python need not be exception to this. How many times have you wondered why there is python,Cpython, Jython, IronPython and many more .*ython, that is, these different flavors of python. Lets start understanding it.

1) Cpython:    The base of all these implementation is Cpython or more formally known as python (Yeah its True your Python is actually Cpython). Cpython is de-facto reference Python implementation. Some implementations may extend behavior or features in some aspects or re-implementations language that do not depend or interact with the CPython runtime core but reuses the standard library implementation of Cpython. Since Cpython is standard to implement python someone can implement it to be compiled or to be interpreted according to their requirements.
    It is written in C and is a bytecode interpreter. Your bytecode is executed on CPython Virtual Machine. That is why I used to say foolishly that Python is interpreted language which indirectly is True about Cpython not for all python variants. 
    Python website categories all other implementations  as Alternative Implementations(Rightly so).

2) Jython:    Jython,earlier known as Jpython (or known to be successor of Jpython) is an implementation of the Python programming language which is designed to run on the Java Platform. It has compiler which compiles your python code into Java bytecode. Stable releases of compatible python(latest) is still not available.
    But question is why we need Jython? Jython adds compatible libraries of python that are really helpful for developers(Now you have both java and Python libraries at one place, isn’t that great?) and make it more powerful language. You can look at this as a way to glue together and leverage an existing collection of class libraries. Some find syntax of python concise and quicker.
    For a instance you want to write HTTP GET request in java,how much code you have to right? and when you know urllib in python isn’t things get easier?

3) IronPython:    IronPython is another implementation of the Python targeting the .NET Framework. It is entirely written in C# and runs on .NET virtual machine(That is nothing but CLR). Again you can import C# classes into ironPython. IronPython can use the .NET Framework and Python libraries(again isn’t that great ), providing Python developers with the power of the .NET framework or providing .NET developers with power of scripting.
    Current version targets Python 2.7.There are some known compatibility issues with Cpython libraries.  

4) Pypy:    Pypy is actually bit different than other implementations its primary focus is to overcome drawbacks(so called) of python. It is a Python interpreter and just-in-time compiler. It is written in Python itself. Actually PyPy is written in a language called RPython, which is suitable for writing dynamic language interpreters (and not much else). RPython is a subset of Python and is itself written in Python. According to claims made by pypy its around 6 times faster than Python.
    While JIT is main goal, it also focuses on efficiency and compatibility with the original CPython interpreter. Its implementation is highly compatible with Cpython.
    There are many talks about pypy being the future of the languages. For some JIT may not that useful for small scripts with non-repeatable code. Lets wait and watch for more of Pypy, future is yet to come.

5) Stackless Python:    If you ever tried threading in python then you will understand what and why Stackless Python. Stackless Python is a reimplementation of traditional python and its key point is lightweight threading. You can say its branch of CPython supporting microthreads.
    Stackless python's key concept is tasklets which is nothing but tiny taks. It allows you to run hundreds of thousands of tasklets in a single main thread. Tasklets run independently on CPU and can communicate with other via Channels. Channel is sort of manager that take all the responsibility to control suspension and resuming of tasklets. You can also control scheduling of tasklets and not to mention serialization with pickle can be done for tasklets.
    Stackless is claimed to be completely compatible with Standard Python but some issue been reported when using with PyQT.

6) ActiveState ActivePython:    ActivePython is a CPython distribution by company named ActiveState. It is commercial implementation and is proprietary. Key points are support and reduced risk for commercial application and some additional modules(Haven't tried it yet).

7) Pythonxy:    Firstly its pronounced as Python x-y and mostly written as Python(X,Y). It is nothing but a scientific Python distribution. Python added with scientific and engineering related packages is your python(x,y). Its also includes Qt for GUI and Spyder IDE.
    Why use it when i can manulally install packages i will be needed. No doubt you can add any package to your python but what if dish is served to you ready made with these packages.

8) Portable Python:    How about having python language pre-configured,any time, any where,run directly from any USB storage device. It is what Portable python is. Its for Windows OS. You just need to extract it to your portable storage device.

9) Anaconda Python:    In short words you can say it is distribution for large-scale data processing, predictive analytics, and scientific computing. This implementation basically focuses large scale of data.

Some other names to mention here. Some of them are language bindings e.g. RubyPython  some are VMs e.g. Brython.

PyIMSL Studio
A commercial distribution for numerical analysis – free for non-commercial use
Enables writing C extensions for the Python easy,  calling C functions and declaring C types variables.
A bridge between the Ruby and Python interpreters. It embeds a running Python interpreter in the Ruby applications.
A portable scientific Python distribution for Windows. same as python(x,y)
Python 3000 (Py3k)
Its nothing but your Python 3
Conceptive PythonSDK
For development and deployment of desktop applications
Language bridge for Python and Objective-C.
Enthought Canopy
Commercial scientific python implementation.
Goal is to replace Javascript with Python, as the scripting language for web browsers. Python Vm written in JS. Python 3.
Combines all the advantages of Qt and Python.


Author :

Amit Khomane

Read More

Tuesday, 11 March 2014

Matrix Multiplication in CUDA using Shared memory

Posted by Unknown 7 Comments
Tiled Matrix Multiplication in CUDA, Matrix Multiplication in CUDA using Shared memory
Global memory access penalties greatly hampers the performance of CUDA codes due to latency involved. In case of Matrix Multiplication, if one implements in the naive way then its apparent that there is plenty of redundant global memory accesses involved, as much of the accessed elements can be reused for computation of several resultant elements, in order to eliminate this redundant one can leverage the shared memory to overcome the global memory access pattern issue involved in this. Due to using tiled based approach the global memory access are been reduced by the factor of tile size, that is, shared memory size for the part of the matrix involved. Go through the code to know the algorithm involved, and any questions I will be pleased to answer.

Matrix Multiplication in CUDA using Shared memory

#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>

// This code assumes that your device support block size of 1024
#define MAX_RANGE 9999

#define funcCheck(stmt) do {                                                    \
        cudaError_t err = stmt;                                               \
        if (err != cudaSuccess) {                                             \
            printf( "Failed to run stmt %d ", __LINE__);                       \
            printf( "Got CUDA error ...  %s ", cudaGetErrorString(err));    \
            return -1;                                                        \
        }                                                                     \
    } while(0)

// Compute C = A * B
__global__ void matrixMultiplyShared(float * A, float * B, float * C,
                                    int numARows, int numAColumns,
                                    int numBRows, int numBColumns,
                                    int numCRows, int numCColumns) 
    __shared__ float sA[32][32];   // Tile size of 32x32 
    __shared__ float sB[32][32];

    int Row = blockDim.y*blockIdx.y + threadIdx.y;
    int Col = blockDim.x*blockIdx.x + threadIdx.x;
    float Cvalue = 0.0;
    sA[threadIdx.y][threadIdx.x] = 0.0;
    sB[threadIdx.y][threadIdx.x] = 0.0;

    for (int k = 0; k < (((numAColumns - 1)/ 32) + 1); k++)
        if ( (Row < numARows) && (threadIdx.x + (k*32)) < numAColumns)
            sA[threadIdx.y][threadIdx.x] = A[(Row*numAColumns) + threadIdx.x + (k*32)];
            sA[threadIdx.y][threadIdx.x] = 0.0;
        if ( Col < numBColumns && (threadIdx.y + k*32) < numBRows)
            sB[threadIdx.y][threadIdx.x] = B[(threadIdx.y + k*32)*numBColumns + Col];
            sB[threadIdx.y][threadIdx.x] = 0.0;

        for (int j = 0; j < 32; ++j)
            Cvalue += sA[threadIdx.y][j] * sB[j][threadIdx.x];
    if (Row < numCRows && Col < numCColumns)
        C[Row*numCColumns + Col] = Cvalue;

void matMultiplyOnHost(float * A, float * B, float * C, int numARows,
                        int numAColumns, int numBRows, int numBColumns,
                        int numCRows, int numCColumns)
    for (int i=0; i < numARows; i ++)
        for (int j = 0; j < numAColumns; j++)
            C[i*numCColumns + j ] = 0.0;
            for (int k = 0; k < numCColumns; k++)
                C[i*numCColumns + j ] += A[i*numAColumns + k] * B [k*numBColumns + j];

int main(int argc, char ** argv) {
    float * hostA; // The A matrix
    float * hostB; // The B matrix
    float * hostC; // The output C matrix
    float * hostComputedC;
    float * deviceA;
    float * deviceB;
    float * deviceC;

    // Please adjust rows and columns according to you need.
    int numARows = 512; // number of rows in the matrix A
    int numAColumns = 512; // number of columns in the matrix A
    int numBRows = 512; // number of rows in the matrix B
    int numBColumns = 512; // number of columns in the matrix B

    int numCRows; // number of rows in the matrix C (you have to set this)
    int numCColumns; // number of columns in the matrix C (you have to set this)

    hostA = (float *) malloc(sizeof(float)*numARows*numAColumns);
    hostB = (float *) malloc(sizeof(float)*numBRows*numBColumns);

    for (int i = 0; i < numARows*numAColumns; i++)
        hostA[i] = (rand() % MAX_RANGE) / 2.0;
    for (int i = 0; i < numBRows*numBColumns; i++)
        hostB[i] = (rand() % MAX_RANGE) / 2.0;

    // Setting numCRows and numCColumns
    numCRows = numARows;
    numCColumns = numBColumns;

    hostC = (float *) malloc(sizeof(float)*numCRows*numCColumns);    
    hostComputedC = (float *) malloc(sizeof(float)*numCRows*numCColumns);    

    // Allocating GPU memory
    funcCheck(cudaMalloc((void **)&deviceA, sizeof(float)*numARows*numAColumns));
    funcCheck(cudaMalloc((void **)&deviceB, sizeof(float)*numBRows*numBColumns));
    funcCheck(cudaMalloc((void **)&deviceC, sizeof(float)*numCRows*numCColumns));

    // Copy memory to the GPU 
    funcCheck(cudaMemcpy(deviceA, hostA, sizeof(float)*numARows*numAColumns, cudaMemcpyHostToDevice));
    funcCheck(cudaMemcpy(deviceB, hostB, sizeof(float)*numBRows*numBColumns, cudaMemcpyHostToDevice));

    // Initialize the grid and block dimensions 
    dim3 dimBlock(32, 32, 1);    
    dim3 dimGrid((numCColumns/32) + 1, (numCRows/32) + 1, 1);

    //@@ Launch the GPU Kernel here
    matrixMultiplyShared<<<dimGrid, dimBlock>>>(deviceA, deviceB, deviceC, numARows, numAColumns, numBRows, numBColumns, numCRows, numCColumns);    

    cudaError_t err1 = cudaPeekAtLastError();
    printf( "Got CUDA error ... %s \n", cudaGetErrorString(err1));

    // Copy the results in GPU memory back to the CPU    
    funcCheck(cudaMemcpy(hostC, deviceC, sizeof(float)*numCRows*numCColumns, cudaMemcpyDeviceToHost));

    matMultiplyOnHost(hostA, hostB, hostComputedC, numARows, numAColumns, numBRows, numBColumns, numCRows, numCColumns);

    for (int i=0; i < numCColumns*numCRows; i++)
        if (hostComputedC[i]  != hostC[i] )
            printf("Mismatch at Row = %d Col = %d hostComputed[] = %f --device[] %f\n", i / numCColumns, i % numCColumns, hostComputedC[i], hostC[i]);
    // Free the GPU memory


    return 0;

Read More

Friday, 7 March 2014

HSA : Heteregeneous System Architecture Why and What

Posted by Unknown
HSA, Heterogeneous System Architecture, HSA : Heterogeneous System Architecture, Why HSA, What is HSA

Why Heterogeneous System Architecture (HSA) :  

    Increasing power consumption is unfavorable across different segments of computing. There is critical need for improved battery life for laptops, tablets, and smartphones. Also, data center computing power and cooling requirements costs continue to increase. On the other side, there is need for improved performance in order to enable enhanced user experiences, to make Human Computer Interaction more natural, and to make devices to manage ever-expanding volumes of data.
    To produce these kind of user experiences, programmer productivity is another essential element that must be delivered. Software developers should be empowered to tap into new capabilities through familiar, powerful programming models.
    Also, it is most critical that software be portable across an enormous range of devices. It is daunting to sustain current trend of re-writing code for an ever enlarging number of different platforms. To overcome this industry needs a different and efficient approach to computer architecture, which solves the issues of power, performance, programmability and portability.

    With the CPUs normally a co-processor is also used for any specialized tasks such as high-end graphics, network etc. The most dominant amongst these is the graphics processing unit (GPU), mainly designed to perform specialized graphics computations in parallel. In recent years, GPUs have become more powerful and more generalized with advent of programming models such as OpenCL and CUDA, allowing them to be applied to general purpose parallel computing tasks with great power to performance factor.
    But CPUs and GPUs have been designed as separate processing elements and do not work together efficiently. Today, an increasing number of mainstream applications require the high performance and power efficiency achievable only through such highly parallel computation. As the current CPUs and GPUs have been designed to be separate processing elements, they do not work together efficiently – and are cumbersome to program. Both has a separate memory space, requiring an application to explicitly copy data from CPU to GPU and then back again.
    Here the working model is that, an application running on the CPU queues work for the GPU using system calls through a device driver stack managed by a completely separate scheduler. This introduces significant dispatch latency, with overhead that makes the process worthwhile only when the application requires a very large amount of parallel computation. Further, if a program running on the GPU wants to directly generate work-items, either for itself or for the CPU, it is impossible today, there has been some support for this since Nvidia Kepler GPU architecture but that is only restricted to these latest architecture with CUDA programming model, not otherwise.
    For leveraging full capabilities of parallel execution units, it is essential for computer system designers to think differently. The designers must re-architect compuing systems to tightly integrate the discrete compute elements on a platform into an evolved central processor while providing a programming path that does not require fundamental changes for software developers. This is main aim of the new HSA design.

What HSA does for all these prevailing issues:

    HSA provides an improved processor design that empowers with the benefits and capabilities of programming just CPUs with different processing elements, making them to work in unison seamlessly. With HSA, applications would be able to create data structures in one unified address space and can initiate work items on the hardware most appropriate for a given task. Sharing data between compute elements is as easy as sending a pointer. Multiple compute tasks can work on the same coherent memory regions, utilizing barriers and atomic memory operations as needed to maintain data synchronization similar to what multi-core CPUs do. These are some of the highlights of HSA, but it is much more than this.
    The HSA Foundation (HSAF) is formed as an open industry standards body to unify the computing industry around a common approach. HSA founding members are: AMD, ARM, Imagination Technologies, MediaTek, Texas Instruments, Samsung Electronics and Qualcomm. And there other lots of supporters from both the academia and industry.
Read More