If you are interested in how those tensor methods work, the authors have another very nice paper that gives a great introduction https://arxiv.org/abs/1810.08671
It is actually not that different from Strassen after all. It just uses a bit more notation to help construct more clever ways of saving multiplication.
I’m more interested why saving multiplications at the cost of more add/subtract helps overall?
On all modern CPUs their speed is the same for float numbers. Specifically, they have 4 cycles of latency on Skylake, 3 cycles of latency on Zen2, throughput (for 32-byte vectors) is 0.5 cycles for both CPUs.
AFAIK the small improvements to asymptotic complexity achieved in the last decade or so don't really mean better practical performance due to higher constant factors.
That probably isn't directly the point either. Perhaps new methods could lead to practical performance improvements in the long run, but my impression is that the recent improvements have been more of theoretical than practical interest.
Even the asymptotic improvements are getting rather minuscule, though.
Right, the recent developments are interesting mainly because they're exploring the limits of the current method. Hopefully this will eventually inspire a sufficiently new approach to get substantial gains.
If you look at Strassens algorithm [1] you see that it splits each matrix in 4 blocks. Naively you can do 8 multiplications of those blocks, as well as 4 matrix additions, and recover the full matrix multiplication.
If T(n) is the number of operations (additions as well as multiplications) for an nxn matrix multiplication, we can write the equation T(n) = 8T(n/2) + 4(n/2)^2.
If we solve this, we end up with T(n) ~ n^3, which is of course the time complexity of naive matrix multiplication.
However, with Strassen you end up with 7 smaller matrix multiplications and 18 matrix additions. The equation we get now is T(n) = 7T(n/2) + 18(n/2)^2.
We end up with T(n) ~ n^(2.8).
We see that when we save _matrix_ multiplications we eventually end up saving both elementary multiplications _and_ additions.
My guess: general software multiplication is usually costlier than addition: it will pay off if you go for divide-and-conquer and your items are themselves matrices, or if they are arbitrary precision integers, or fractions, or elements of any ring really.