logoalt Hacker News

nyrikkilast Tuesday at 8:28 PM0 repliesview on HN

The original paper about the bias variance tradeoff, that the double decent papers targeted, had some specific constraints.

1) data availability and computer limited training set sizes. 2) they could simulate infinite datasets.

While challenging for our minds, training set sizes today make it highly likely that the patterns in your test set are similar to concept classes in your training set.

This is very different than saying procedure or random generated test sets, both of which can lead to problems like over fitting with over parameterized networks.

When the chances are that similar patterns exist, the cost of some memorization goes down and is actually somewhat helpful for generalization.

There are obviously more factors at play here, but go look at the double decent papers and their citations to early 90's papers and you will see this.

The low sensitivity of transformers also dramatically helps, with UHAT without CoT only having the expressiveness of TC0, and with log space scratch space having PTIME expressability.

You can view this from autograd requiring a smooth manifold with the ability to approximate global gradient too if that works better for you.

But yes all intros have to simplify concepts, and there are open questions.