INSIGHTS
This project is indeed very rewarding, we end up using most
of what we have learnt in class and implemented QR factorization on MPI and
CUDA. Also, as we have used the premade kernels from CUBLAS library, it
certainly gives us more hands-on experience on how to use the library. We
believe that we don’t need to reinvent the wheel, but need to understand how to
use the basic functions to construct a more complicated project. We focused on
one of the challenge of the project, which is to be faster than numpy. Indeed
we have achieved this by using MPI + numpy implementation. However, our python
implementation of householder QR factorization algorithm is quite slow compared
with numpy. According to many of the documentation, numpy implemented LAPACK
(Linear Algebra PACKage) which is written in C/fortran by Linear Algebra
experts. Therefore, it is quite hard to outperform it by our codes written in
python on householder algorithm.
FUTURE EXTENSIONS
There is quite some room for further extension. If we have
more time, we would like to implement a few more serial QR algorithms, which suggest
by some literatures that it may have better performance. Also, in order to
compare the performance and be faster than numpy QR, we should implement all
the codes in C to have better speedup. Also, due to the complexity of matrix
multiplication, norm calculation and vector matrix multiplication, we have used
the CUBLAS library. We would like to extend to write all the kernels in C and
call the kernels in C as well. We believe in such a way, we can beat the numpy
algorithm in a parallel setting. This has also been implemented by some
literature. Due to the complexity of using parallel computing in C,
we did not pursue in this direction in the current project but we would like to try in the future.
REFLECTION
We enjoyed doing the project mostly by learning how to use the CUBLAS library. It certainly was very frustrating and we end up tried quite a few examples to make it work. Also, MPI implementation definitely involves some logic design and thinking. We have been thinking and discussing how to track the reduction list, and how to send the data over to make the Q matrix. Not only to make the codes working, but also we put a lot of the efforts on optimizing the codes. In the MPI implementation, we have tested all the major parts line by line to see which takes the most of time.
The project is challenging mainly due to we are targeting to beat numpy, and beat our own algorithm too. As we mentioned above, it is quite difficult to outperform, due to both language barrier and algorithm complexity.
If we have the chance to do this again, we will probably implement all the codes in C so that we have a better understanding on how it works, and potentially have a better speedup.
The project is challenging mainly due to we are targeting to beat numpy, and beat our own algorithm too. As we mentioned above, it is quite difficult to outperform, due to both language barrier and algorithm complexity.
If we have the chance to do this again, we will probably implement all the codes in C so that we have a better understanding on how it works, and potentially have a better speedup.
CS 205 Final Project | Fall 2013 | Xueshan Zhang and Ming Zhu