Yes, it also applies on server architectures that derive from Skylake (aka SKX), and I've tested it there. However, it is probably of less utility there since currently all Skylake server architectures support AVX-512, which lets you load an entire 64-byte cache line in a single load, and SKX can do these at a rate of about 1 per cycle - so you can already do better than the technique described here simply by using AVX-512.
As I mentioned near the end of the wiki page, this might still be useful in scenarios where you don't want to use AVX-512 for some reason.
One reason to avoid copious AVX-512 instructions is that doing so is guaranteed to cause the processor to reduce its clock rate (see the Optimization manual for crossover points when workloads make heavy (or even medium) use of AVX-512 (or in some cases AVX2).
As I mentioned near the end of the wiki page, this might still be useful in scenarios where you don't want to use AVX-512 for some reason.