logoalt Hacker News

imtringuedtoday at 10:00 AM1 replyview on HN

Your list is so short it doesn't even include the basics such as reordering operations.

It also feels incredibly snarky to say "they knew about caching" and that they will get to partial evaluation and dead code elimination, when those seem to be particularly useless (beyond what the CUDA compiler itself does) when it comes to writing GPU kernels or doing machine learning in general.

You can't do any partial evaluation of a neural network because the activation functions are interrupting the multiplication of tensors. If you remove the activation function, then you end up with two linear layers that are equivalent to one linear layer, defeating the point of the idea. You could have trained a network with a single layer instead and achieved the same accuracy with a corresponding shorter training/inference time.

Dead code elimination is even more useless since most kernels are special purpose to begin with and you can't remove tensors without altering the architecture. Instead of adding useless tensors only to remove them, you could have simply used a better architecture.


Replies

torginustoday at 12:37 PM

I think you can. If you have a neuron whose input weights are 100,-1,2, with threshold 0, you can know the output of the neuron if the first input is enabled, as the other 2 dont matter, so you can skip evaluating those.

I'm not enough of an expert to see if there's any actualy merit to this idea, and if you can skip evaluating huge parts of the network and keeping track of such evaluations, is actually worth it, but it intuitively makes sense to me that making an omelette has nothing to do with the Battle of Hastings, so when making a query about the former, the neurons encoding the latter might not affect the output.

Afaik, there's already research into finding which network weight encode which concepts.

MOE is a somewhat cruder version of this technique.