A matrix transformation (assuming you are just multiplying by a matrix) is really not that much. If the matrix has known co-effecients that is basically going to be the same computationally as a linear interpolation (a linear interpolation requires a vector addition and a scalar multiplication operation).
But you have to know the direction vector and if you don't know this, you have to calculate it.
A 3x3 matrix requires 3 lots of 3 dot products (one for each row in the final column matrix) and if you want, you can code up a very fast assembler routine in SSE2 that will get the best number of clock cycles, or you can get your GPU to do it on some really fast shared memory with the many of its vertex GPU cores (that will do many in parallel).
If this is for a game engine, then you really should be doing as much as you can on the GPU anyway: but if you really need to do this in software, get an optimized assembler routine, and check the size of your SSE registers and create separate optimized routines for the different architecture specs and you should see an improvement.
A final option is to get some library code that does this, or use an optimized compiler that can create code that is as close as the hand-written optimal code as possible.