Disclaimer: I am very well aware this is not a valid test or indicative or anything else. I just thought it was hilarious.
When I asked the normal "How many 'r' in strawberry" question, it gets the right answer and argues with itself until it convinces itself that its (2). It counts properly, and then says to it self continuously, that can't be right.
https://gist.github.com/IAmStoxe/1a1e010649d514a45bb86284b98...
It's funny because this simple excercise shows all the problems that I have using the reasoning models: they give a long reasoning that just takes too much time to verify and still can't be trusted.
DeepSeek-R1-Distill-Qwen-32B-Q6_K_L.gguf solved this:
In which of the following Incertae sedis families does the letter `a` appear the most number of times?
``` Alphasatellitidae Ampullaviridae Anelloviridae Avsunviroidae Bartogtaviriformidae Bicaudaviridae Brachygtaviriformidae Clavaviridae Fuselloviridae Globuloviridae Guttaviridae Halspiviridae Itzamnaviridae Ovaliviridae Plasmaviridae Polydnaviriformidae Portogloboviridae Pospiviroidae Rhodogtaviriformidae Spiraviridae Thaspiviridae Tolecusatellitidae ```
Please respond with the name of the family in which the letter `a` occurs most frequently
https://pastebin.com/raw/cSRBE2Zy
I used temp 0.2, top_k 20, min_p 0.07
I wonder if the reason the models have problem with this is that their tokens aren't the same as our characters. It's like asking someone who can speak English (but doesn't know how to read) how many R's are there in strawberry. They are fluent in English audio tokens, but not written tokens.
This was my first prompt after downloading too and I got the same thing. Just spinning again and again based on it's gut instinct that there must be 2 R's in strawberry, despite the counting always being correct. It just won't accept that the word is spelled that way and it's logic is correct.
I think it's great that you can see the actual chain of thought behind the model, not just the censored one from OpenAI.
It strikes me that it's both so far from getting it correct and also so close- I'm not an expert but it feels like it could be just an iteration away from being able to reason through a problem like this. Which if true is an amazing step forward.
I tried this via the chat website and it got it right, though strongly doubted itself. Maybe the specific wording of the prompt matters a lot here?
https://gist.github.com/gsuuon/c8746333820696a35a52f2f9ee6a7...
lol what a chaotic read that is, hilarious. Just keeps refusing to believe there's three. WAIT, THAT CAN'T BE RIGHT!
How long until we get to the point where models know that LLMs get this wrong, and that it is an LLM, and therefore answers wrong on purpose? Has this already happened?
(I doubt it has, but there ARE already cases where models know they are LLMs, and therefore make the plausible but wrong assumption that they are ChatGPT.)
I tend to avoid that one because of the tokenization aspect. This popular one is a bit better:
"Alice has N brothers and she also has M sisters. How many sisters does Alice's brother have?"
The 7b one messed it up first try:
>Each of Alice's brothers has \(\boxed{M-1}\) sisters.
Trying again:
>Each of Alice's brothers has \(\boxed{M}\) sisters.
Also wrong. Again:
>\[ >\boxed{M + 1} >\]
Finally a right answer, took a few attempts though.
I think there is an inherent weight associated with the intrinsic knowledge opposed to the reasoning steps as intrinsic knowledge can override reasoning.
Written out here: https://news.ycombinator.com/item?id=42773282
This is incredibly fascinating.
I feel like one round of RL could potentially fix "short circuits" like these. It seems to be convinced that a particular rule isn't "allowed," when it's totally fine. Wouldn't that mean that you just have to fine tune it a bit more on its reasoning path?
Just by asking it to validate its own reasoning it got it right somehow. https://gist.github.com/dadaphl/1551b5e1f1b063c7b7f6bb000740...
This is from a small model. 32B and 70B answer this correctly. "Arrowroot" too. Interestingly, 32B's "thinking" is a lot shorter and it seems to be more "sure". Could be because it's based on Qwen rather than LLaMA.
How would they build guardrails for this? In CFD, physical simulation with ML, they talk about using physics-informed models instead of purely statistical. How would they make language models that are informed with formal rules, concepts of English?
if how us humans reason about things is a clue, language is not the right tool to reason about things.
There is now research in Large Concept Models to tackle this but I'm not literate enough to understand what that actually means...
This is great! I'm pretty sure it's because the training corpus has a bunch of "strawberry spelled with two R's" and it's using that
Maybe the AI would be smarter if it could access some basic tools instead of doing it its own way.
Love this interaction, mind if I repost your gits link elsewhere?
perhaps they need to forget once they learnt reasoning... this is hilarious thank you
omg lol "here we go, the first 'R'"
Ahhahah that's beautiful, I'm crying.
Skynet sends Terminator to eradicate humanity, the Terminator uses this as its internal reasoning engine... "instructions unclear, dick caught in ceiling fan"