Wow. This sounds a lot like the Resnet moment for Transformers.

albertzeyer · on March 3, 2022

Interestingly, Transformer has already used residual connections before. The new part here is to actually not use the standard residual connections anymore but normalize after every layer (including the residual connection), and also have a factor on the residual connection. So you don't have a direct pathway anymore.

The argument is that otherwise, the gradient magnitude for the lower layers becomes too big. Which intuitively makes sense, because due to the residual connections, all error signals from the upper layers will end up at the lower layers.

I wonder why this is apparently not a problem for ResNet.

orbifold · on March 3, 2022

This moves it closer to a "temporal" integration as well. Since the factor on the residual branch can be considered a timestep in an Euler-discretization.

oofbey · on March 3, 2022

I hope so, but I'm not convinced. The experimental results don't persuade me that deep transformers are better than the existing extremely wide transformers with tons of parameters. Yes, they show a much better result on the multilingual benchmark (M2M-100), but that's not a super well studied problem, so I don't know that the baseline they're beating is very well optimized. And on OPUS-100 they only beat the baseline by building models with vastly more parameters, which shouldn't really surprise anybody.

mrfusion · on March 3, 2022

What’s that?

bckr · on March 3, 2022

Residual neural networks[] employ skip-layer connections. It's thought that this helps prevent gradient decay in deep networks. It's a very important advancement.

[]https://www.wikiwand.com/en/Residual_neural_network

quocanh · on March 3, 2022

Resnet was when we figured out how to throw a fuckton of perceptrons together and make it actually work. We didn't think networks scaled before that.

tusharsadhwani · on March 3, 2022

iirc that was AlexNet, ResNet came 3 years later whose refinements ended up making it better than a human at object detection.

mlajtos · on March 3, 2022

I think the argument is that before ResNets, the depth of the network was constrained and could not be scaled easily. With ResNets (and of course Highway Nets; hi Jürgen!) the depth is just another hyperparam.

bertday · on March 3, 2022

In between those, VGG was parameterized by layers, though severely limited in scope (16-19 layers).