Abstract
1.1 Background
Matrix and vector operations are a substantial part of the scientific computing
workload, and have been subject to much work and many optimisations in
order to increase the performance and efficiency. The introduction of parallel
computing and distributed data has complicated the work required to achieve
gains in performance, and several libraries have been written in an attempt
to hide much of the details and the difficulty involved with high-performance
parallel programming. With parallel computing came many new concepts and
challenges to computer science. For example the gain by using several processors
to do the work previously done by one was defined as speedup (Equation
1.1), where Tp is the time it takes the parallel implementation to complete the
same task (On p processors) as the serial implementation can do in time Ts.
Sp = Ts
Tp
(1.1)
Speedup equal to p (The number of processors) is called linear speedup, and
implies that doubling the number of processors halves the wall-time required
to complete the specific task. This is considered good speedup, but is difficult
to achieve due to the added communication between the processors in addition
to the computations they had to do initially. Sometimes super-linear speedup
(Sp > P) is observed, as splitting the domain over several processors will make
the sub-domains fit in a higher level of cache at the processors. One industry
standard library for inter-processor communication is the Message Passing Interface
(MPI [1]). To minimise the effects of communication it supports several
communication methods, deciding which one is best suited depends on the
situation.