logoalt Hacker News

fpgaminer01/20/20251 replyview on HN

I think "reasoning" models will solve the joke issue (amongst other issues), but not because they're "reasoning". Rather because they help solve the exploration issue and the scaling issue.

Having worked with LLMs a lot for my JoyCaption project, I've got all these hypothesis floating around in my head. I guess the short version, specifically for jokes, is that we lack "joke reasoning" data. The solution, like mathematical problems, is to get the LLM to generate the data and then RL it into more optimal solutions.

Longer explanation:

Imagine we want an LLM to correctly answer "How many r's are in the word strawberry?". And imagine that language has been tokenized, and thus we can form a "token space". The question is a point in that space, point Q. There is a set of valid points, set A, that encompasses _any_ answer to this question which is correct. There are thus paths through token space from point Q to the points contained by set A.

A Generator LLM's job is, given a point, predict valid paths through token space. In fact, we can imagine the Generator starting at point Q and walking its way to (hopefully) some point in set A, along a myriad of inbetween points. Functionally, we have the model predict next token (and hence point in token space) probabilities, and we can use those probabilities to walk the path.

An Ideal Generator would output _all_ valid paths from point Q to set A. A Generator LLM is a lossy compression of that ideal model, so in reality the set of paths the Generator LLM will output might encompass some of those valid paths, but it might also encompass invalid paths.

One more important thing about these paths. Imagine that there is some critical junction. A specific point where, if the Generator goes "left", it goes into a beautiful flat, grassy plain where the sun is shining. That area is really easy to navigate, and the Generator LLM's predictions are all correct. Yay! But if it goes "right" it ends up in the Fire Swamp with many dangers that it is not equipped to handle. i.e. it isn't "smart" enough in that terrain and will frequently predict invalid paths.

Pretraining already taught the Generator LLM to avoid invalid paths to the best of its abilities, but again its abilities are limited.

To fix this, we use RL. A Judge LLM takes a completed path and determines if it landed in the set A or not. With an RL algorithm and that reward signal, we can train the Generator LLM to avoid the Fire Swamp, since it often gets low rewards there, and instead goes to the Plain since it often gets rewards there.

This results in a Generator LLM that is more _reliable_ and thus more useful. The RL encourages it to walk paths it's good at and capable of, avoid paths it struggles with, and of course encourages valid answers whenever possible.

But what if the Generator LLM needs to solve a really hard problem. It gets set down at point Q, and explores the space based on its pretraining. But that pretraining _always_ takes it through a mountain and it never succeeds. During RL the model never really learns a good path, so these tend to manifest as hallucinations or vapid responses that "look" correct.

Yet there are very easy, long paths _around_ the mountain that gets to set A. Those don't get reinforced because they never get explored. They never get explored because those paths weren't in the pretraining data, or are so rare that it would take an impractical amount of exploration for the PT model to output them.

Reasoning is one of those long, easy paths. Digestible small steps that a limited Generator LLM can handle and use to walk around the mountain. Those "reasoning" paths were always there, and were predicted by the Ideal Generator, but were not explored by our current models.

So "reasoning" research is fundamentally about expanding the exploration of the pretrained LLM. The judge gets tweaked slightly to encourage the LLM to explore those kinds of pathways, and/or the LLM gets SFT'd with reasoning data (which is very uncommon in its PT dataset).

I think this breakdown and stepping back is important so that we can see what we're really trying to do here: get a limited Generator LLM to find its way around areas it can't climb. It is likely true that there is _always_ some path from a given point Q and set A that a limited Generator LLM can safely traverse, even if that means those paths are very long.

It's not easy for researchers to know what paths the LLM can safely travel. So we can't just look at Q and A and build a nice dataset for it. It needs to generate the paths itself. And thus we arrive at Reasoning.

Reasoning allows us to take a limited, pretrained LLM, and turn it into a little path finding robot. Early during RL it will find really convoluted paths to the solution, but it _will_ find a solution, and once it does it gets a reward and, hopefully, as training progresses, it learns to find better and shorter paths that it can still navigate safely.

But the "reasoning" component is somewhat tangential. It's one approach, probably a very good approach. There are probably other approaches. We just want the best ways to increase exploration efficiently. And we're at the point where existing written data doesn't cover it, so we need to come up with various hacks to get the LLM to do it itself.

The same applies to jokes. Comedians don't really write down every single thought in their head as they come up with jokes. If we had that, we could SFT existing LLMs to get to a working solution TODAY, and then RL into something optimal. But as it stands PT LLMs aren't capable of _exploring_ the joke space, which means they never come out of the RL process with humor.

Addendum:

Final food for thought. There's kind of this debating going on about "inference scaling", with some believing that CoT, ToT, Reasoning, etc are all essentially just inference scaling. More output gives the model more compute so it can make better predictions. It's likely true that that's the case. In fact, if it _isn't_ the case we need to take a serious look at our training pipelines. But I think it's _also_ about exploring during RL. The extra tokens might give it a boost, sure, but the ability for the model to find more valid paths during RL enables it to express more of its capabilities and solve more problems. If the model is faced with a sheer cliff face it doesn't really matter how much inference compute you throw at it. Only the ability for it to walk around the cliff will help.

And, yeah, this all sounds very much like ... gradient descent :P and yes there have been papers on that connection. It very much seems like we're building a second layer of the same stuff here and it's going to be AdamW all the way down.


Replies

kridsdale101/21/2025

I’m on my phone so I can’t give this a proper response but I want to say that your mental intuition about the latent space algorithms is excellent and has improved my thinking. I haven’t seen much writing applying pathfinding (what we used to call AI, in the Half Life days) terminology to this. Your ideal generator sounds like letting A* run on all nodes in a grid and not exiting when the first path is found.

Mountains and cliffs are a good way to describe the terrain of the topology of the weights in hyper dimensional space though they are terms for a 2D matrix.