The first instance of 8 way unrolling loop: acc += A acc += B ... was inefficien...

The first instance of 8 way unrolling

    loop:
        acc += A
        acc += B ...

was inefficient because each add instruction is dependent on the one before it. The CPU has to pause constantly to wait for the result in acc in order to continue.

The correct way to do this is to have 8 accumulators (or whatever the loop unrolling depth is) and then to sum those together at the end. This helps to keep the processor's pipelines full.

    loop:
        acc1 += A
        acc2 += B
    
    acc = acc1 + acc2

The author's use of SIMD instructions is even better still, where multiple variables were used. However intrinsics would have been far more readable.

For further speed improvements, streaming intrinsics (since all reads are only done once, and never written to) could be useful. Also OpenMP for multithreading would be a good fit here.