Discussion

INSIGHTS

This project is indeed very rewarding, we end up using most of what we have learnt in class and implemented QR factorization on MPI and CUDA. Also, as we have used the premade kernels from CUBLAS library, it certainly gives us more hands-on experience on how to use the library. We believe that we don’t need to reinvent the wheel, but need to understand how to use the basic functions to construct a more complicated project. We focused on one of the challenge of the project, which is to be faster than numpy. Indeed we have achieved this by using MPI + numpy implementation. However, our python implementation of householder QR factorization algorithm is quite slow compared with numpy. According to many of the documentation, numpy implemented LAPACK (Linear Algebra PACKage) which is written in C/fortran by Linear Algebra experts. Therefore, it is quite hard to outperform it by our codes written in python on householder algorithm.

FUTURE EXTENSIONS

There is quite some room for further extension. If we have more time, we would like to implement a few more serial QR algorithms, which suggest by some literatures that it may have better performance. Also, in order to compare the performance and be faster than numpy QR, we should implement all the codes in C to have better speedup. Also, due to the complexity of matrix multiplication, norm calculation and vector matrix multiplication, we have used the CUBLAS library. We would like to extend to write all the kernels in C and call the kernels in C as well. We believe in such a way, we can beat the numpy algorithm in a parallel setting. This has also been implemented by some literature. Due to the complexity of using parallel computing in C, we did not pursue in this direction in the current project but we would like to try in the future.

REFLECTION

We enjoyed doing the project mostly by learning how to use the CUBLAS library. It certainly was very frustrating and we end up tried quite a few examples to make it work. Also, MPI implementation definitely involves some logic design and thinking. We have been thinking and discussing how to track the reduction list, and how to send the data over to make the Q matrix. Not only to make the codes working, but also we put a lot of the efforts on optimizing the codes. In the MPI implementation, we have tested all the major parts line by line to see which takes the most of time.

The project is challenging mainly due to we are targeting to beat numpy, and beat our own algorithm too. As we mentioned above, it is quite difficult to outperform, due to both language barrier and algorithm complexity.

If we have the chance to do this again, we will probably implement all the codes in C so that we have a better understanding on how it works, and potentially have a better speedup.

CS 205 Final Project | Fall 2013 | Xueshan Zhang and Ming Zhu