I'm trying to implement SSE
version of large matrix by matrix multiplication.
I'm looking for an efficient algorithm based on SIMD
implementations.
My desired method looks like:
A(n x m) * B(m x k) = C(n x k)
And all matrices are considered to be 16-byte aligned float array.
I searched the net and found some articles describing 8x8 multiplication and even smaller. I really need it as efficient as possible and I don't want to use Eigen
library or similar libraries. (Only SSE3
to be more specific).
So I'd appreciate if anyone can help me find some articles or resources on how to start implementing this.
See Question&Answers more detail:os