OK, these are a LOT of fun to play with. I've been trying out a quantized version of the Llama ...

simonw • 01/20/2025 • 23 replies • view on HN

OK, these are a LOT of fun to play with. I've been trying out a quantized version of the Llama 3 one from here: https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-...

The one I'm running is the 8.54GB file. I'm using Ollama like this:

    ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0

You can prompt it directly there, but I'm using my LLM tool and the llm-ollama plugin to run and log prompts against it. Once Ollama has loaded the model (from the above command) you can try those with uvx like this:

    uvx --with llm-ollama \
      llm -m 'hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0' \
      'a joke about a pelican and a walrus who run a tea room together'

Here's what I got - the joke itself is rubbish but the "thinking" section is fascinating: https://gist.github.com/simonw/f505ce733a435c8fc8fdf3448e381...

I also set an alias for the model like this:

    llm aliases set r1l 'hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0'

Now I can run "llm -m r1l" (for R1 Llama) instead.

I wrote up my experiments so far on my blog: https://simonwillison.net/2025/Jan/20/deepseek-r1/

Replies

simonw • 01/20/2025

I got a quantized Llama 70B model working, using most of my 64GB of RAM but it's usable:

    ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF:Q3_K_M

That's a 34GB download. I'm accessing it via https://github.com/open-webui/open-webui which I ran like this:

    uvx --python 3.11 open-webui serve

I have TailScale on my laptop and phone so I can run experiments directly from my phone while leaving my laptop plugged in at home.

➕ show 6 replies

peeters • 01/21/2025

> Wait, maybe the punchline is something like: "We don’t have any fish in the tea, but we do have a lot of krill."

Shucks, it was so close to coming up with a good punchline it could work back from.

I'm thinking set it in a single-cell comic. A downtrodden young man or woman sitting alone at a table, a pelican in the background clearly making drinks in its voluminous beak, and the walrus waiter places a cup in front of the person, consolingly saying "there's plenty of fish in the tea".

➕ show 4 replies

HarHarVeryFunny • 01/20/2025

I think the problem is that humor isn't about reasoning and logic, but almost the reverse - it's about punchlines that surprise us (i.e. not what one would logically anticipate) and perhaps shock us by breaking taboos.

Even masters of humor like Seinfeld, with great intuition for what might work, still need to test new material in front of a live audience to see whether it actually does get a laugh or not.

➕ show 3 replies

momojo • 01/20/2025

> the joke itself is rubbish but the "thinking" section is fascinating:

This is gold. If I was a writer, I'd wring value from that entire thinking-out-loud section and toss the actual punchline.

This is weirdly reminiscent of co-programming with CodyAI. It gives me a lot of good 'raw material' and I'm left integrating the last mile stuff.

➕ show 2 replies

monkeydust • 01/20/2025

Thanks! Playing around with this vs the https://ollama.com/tripplyons/r1-distill-qwen-7b variant and find 7b to be somewhat of sweet spot of getting to the point with minimal (or less) waffle.

Certainly, interesting reading their thought processes, value in that might be greater than the answer itself depending on use-case.

wat10000 • 01/20/2025

This joke is so terrible, I think this might end up being how AI kills us all when it decides it needs us out of the way to make more paperclips.

➕ show 1 reply

widdershins • 01/20/2025

Yeesh, that shows a pretty comprehensive dearth of humour in the model. It did a decent examination of characteristics that might form the components of a joke, but completely failed to actually construct one.

I couldn't see a single idea or wordplay that actually made sense or elicited anything like a chuckle. The model _nearly_ got there with 'krill' and 'kill', but failed to actually make the pun that it had already identified.

➕ show 1 reply

laweijfmvo • 01/21/2025

why shouldn’t i assume that the “thinking” is just the usual LLM regurgitation of “how would a human coming up with a joke explain their reasoning?” or something like that, and zero “thinking”?

➕ show 3 replies

croemer • 01/20/2025

Can someone ELI5 what the difference is between using the "quantized version of the Llama 3" from unsloth instead of the one that's on ollama, i.e. `ollama run deepseek-r1:8b`?

➕ show 2 replies

reissbaker • 01/20/2025

FWIW, you can also try all of the distills out in BF16 on https://glhf.chat (either in the UI or via the API), including the 70b. Personally I've been most impressed with the Qwen 32b distill.

(Disclosure: I'm the cofounder)

➕ show 2 replies

gjm11 • 01/21/2025

What's your sense of how useful local LLMs are for things other than ... writing blog posts about experimenting with local LLMs? :-)

(This is a serious question, not poking fun; I am actually curious about this.)

➕ show 2 replies

TeMPOraL • 01/20/2025

Did you try the universal LLM cheat code as a followup prompt?

"Make it better"

➕ show 1 reply

lmc • 01/20/2025

> The walrus might say something like, "We have the biggest catch in town," while the pelican adds a line about not catching any fish recently.

It should've stopped there :D

earth2mars • 01/21/2025

Tried exactly the same model. And unfortunately the reasoning is just useless. Built it is still not able to tell how many r's in strawberry.

➕ show 1 reply

ryanisnan • 01/20/2025

Super interesting. It seems to get hung up on a few core concepts, like the size of the walrus vs. the limited utility of a pelican beak.

jonplackett • 01/21/2025

This is probably pretty similar to my inner monologue as I would try and inevitably fail to come up with a good joke.

newman314 • 01/21/2025

Have you had a chance to compare performance and results between the Qwen-7B and Llama-8B versions?

riwsky • 01/21/2025

“I never really had a childhood”, said Walrus, blowing on his tea with a feigned sigh. “Why’s that?” asked Pelican, refilling a sugar shaker. Walrus: “I was born long in the tooth!” Pelican: [big stupid pelican laughing noise]

dcreater • 01/21/2025

Why ask it for a joke? That's such a bad way to try out a reasoning model

➕ show 4 replies

fsndz • 01/21/2025

frankly ollama + Deepseek is all you need to win with open source AI. I will do some experiments today and add it to my initial blogpost. https://medium.com/thoughts-on-machine-learning/deepseek-is-...

linsomniac • 01/20/2025

>a joke about a pelican and

Tell me you're simonw without telling me you're simonw...

tomrod • 01/20/2025

Can you recommend hardware needed to run these?

➕ show 2 replies

fpgaminer • 01/20/2025

I think "reasoning" models will solve the joke issue (amongst other issues), but not because they're "reasoning". Rather because they help solve the exploration issue and the scaling issue.

Having worked with LLMs a lot for my JoyCaption project, I've got all these hypothesis floating around in my head. I guess the short version, specifically for jokes, is that we lack "joke reasoning" data. The solution, like mathematical problems, is to get the LLM to generate the data and then RL it into more optimal solutions.

Longer explanation:

Imagine we want an LLM to correctly answer "How many r's are in the word strawberry?". And imagine that language has been tokenized, and thus we can form a "token space". The question is a point in that space, point Q. There is a set of valid points, set A, that encompasses _any_ answer to this question which is correct. There are thus paths through token space from point Q to the points contained by set A.

A Generator LLM's job is, given a point, predict valid paths through token space. In fact, we can imagine the Generator starting at point Q and walking its way to (hopefully) some point in set A, along a myriad of inbetween points. Functionally, we have the model predict next token (and hence point in token space) probabilities, and we can use those probabilities to walk the path.

An Ideal Generator would output _all_ valid paths from point Q to set A. A Generator LLM is a lossy compression of that ideal model, so in reality the set of paths the Generator LLM will output might encompass some of those valid paths, but it might also encompass invalid paths.

One more important thing about these paths. Imagine that there is some critical junction. A specific point where, if the Generator goes "left", it goes into a beautiful flat, grassy plain where the sun is shining. That area is really easy to navigate, and the Generator LLM's predictions are all correct. Yay! But if it goes "right" it ends up in the Fire Swamp with many dangers that it is not equipped to handle. i.e. it isn't "smart" enough in that terrain and will frequently predict invalid paths.

Pretraining already taught the Generator LLM to avoid invalid paths to the best of its abilities, but again its abilities are limited.

To fix this, we use RL. A Judge LLM takes a completed path and determines if it landed in the set A or not. With an RL algorithm and that reward signal, we can train the Generator LLM to avoid the Fire Swamp, since it often gets low rewards there, and instead goes to the Plain since it often gets rewards there.

This results in a Generator LLM that is more _reliable_ and thus more useful. The RL encourages it to walk paths it's good at and capable of, avoid paths it struggles with, and of course encourages valid answers whenever possible.

But what if the Generator LLM needs to solve a really hard problem. It gets set down at point Q, and explores the space based on its pretraining. But that pretraining _always_ takes it through a mountain and it never succeeds. During RL the model never really learns a good path, so these tend to manifest as hallucinations or vapid responses that "look" correct.

Yet there are very easy, long paths _around_ the mountain that gets to set A. Those don't get reinforced because they never get explored. They never get explored because those paths weren't in the pretraining data, or are so rare that it would take an impractical amount of exploration for the PT model to output them.

Reasoning is one of those long, easy paths. Digestible small steps that a limited Generator LLM can handle and use to walk around the mountain. Those "reasoning" paths were always there, and were predicted by the Ideal Generator, but were not explored by our current models.

So "reasoning" research is fundamentally about expanding the exploration of the pretrained LLM. The judge gets tweaked slightly to encourage the LLM to explore those kinds of pathways, and/or the LLM gets SFT'd with reasoning data (which is very uncommon in its PT dataset).

I think this breakdown and stepping back is important so that we can see what we're really trying to do here: get a limited Generator LLM to find its way around areas it can't climb. It is likely true that there is _always_ some path from a given point Q and set A that a limited Generator LLM can safely traverse, even if that means those paths are very long.

It's not easy for researchers to know what paths the LLM can safely travel. So we can't just look at Q and A and build a nice dataset for it. It needs to generate the paths itself. And thus we arrive at Reasoning.

Reasoning allows us to take a limited, pretrained LLM, and turn it into a little path finding robot. Early during RL it will find really convoluted paths to the solution, but it _will_ find a solution, and once it does it gets a reward and, hopefully, as training progresses, it learns to find better and shorter paths that it can still navigate safely.

But the "reasoning" component is somewhat tangential. It's one approach, probably a very good approach. There are probably other approaches. We just want the best ways to increase exploration efficiently. And we're at the point where existing written data doesn't cover it, so we need to come up with various hacks to get the LLM to do it itself.

The same applies to jokes. Comedians don't really write down every single thought in their head as they come up with jokes. If we had that, we could SFT existing LLMs to get to a working solution TODAY, and then RL into something optimal. But as it stands PT LLMs aren't capable of _exploring_ the joke space, which means they never come out of the RL process with humor.

Addendum:

Final food for thought. There's kind of this debating going on about "inference scaling", with some believing that CoT, ToT, Reasoning, etc are all essentially just inference scaling. More output gives the model more compute so it can make better predictions. It's likely true that that's the case. In fact, if it _isn't_ the case we need to take a serious look at our training pipelines. But I think it's _also_ about exploring during RL. The extra tokens might give it a boost, sure, but the ability for the model to find more valid paths during RL enables it to express more of its capabilities and solve more problems. If the model is faced with a sheer cliff face it doesn't really matter how much inference compute you throw at it. Only the ability for it to walk around the cliff will help.

And, yeah, this all sounds very much like ... gradient descent :P and yes there have been papers on that connection. It very much seems like we're building a second layer of the same stuff here and it's going to be AdamW all the way down.

➕ show 1 reply

alt Hacker News

Replies