logoalt Hacker News

LeroyRazlast Wednesday at 9:14 PM3 repliesview on HN

I am surprised the author thought the project passed quality control. The LLM reviews seem mostly false.

Looking at the comment reviews on the actual website, the LLM seems to have mostly judged whether it agreed with the takes, not whether they came true, and it seems to have an incredibly poor grasp of it's actual task of accessing whether the comments were predictive or not.

The LLM's comment reviews are of often statements like "correctly characterized [program language] as [opinion]."

This dynamic means the website mostly grades people on having the most confirmist take (the take most likely to dominate the training data, and be selected for in the LLM RL tuning process of pleasing the average user).


Replies

LeroyRazlast Wednesday at 9:28 PM

Examples: tptacek gets an 'A' for his comment on DF which the LLM claiming that the user "captured DF's unforgiving nature, where 'can't do x or it crashes is just another feature to learn' which remained true until it was fixed on ..."

Link to LLM review: https://karpathy.ai/hncapsule/2015-12-02/index.html#article-....

So the LLM is praising a comment as describing DF as unforgiving (a characterization of the present then, not a statement about the future). And worse, it seems like tptacek may in fact be implying the opposite of the future (e.g., x will continue to crash when it was eventually fixed.)

Here is the original comment: " tptacek on Dec 2, 2015 | root | parent | next [–]

If you're not the kind of person who can take flaws like crashes or game-stopping frame-rate issues and work them into your gameplay, DF is not the game for you. It isn't a friendly game. It can take hours just to figure out how to do core game tasks. "Don't do this thing that crashes the game" is just another task to learn."

Note: I am paraphrasing the LLM review, as the website is also poorly designed, with one unable to select the text of the LLM review!

N.b., this choice of comment review is not overly cherry picked. I just scanned the "best commentators" and tptacek was number two, with this particular egregiously unrelated-to-prediction LLM summary given as justifying his #2 rating.

hathawshlast Wednesday at 9:28 PM

Are you sure? The third section of each review lists the “Most prescient” and “Most wrong” comments. That sounds exactly like what you're looking for. For example, on the "Kickstarter is Debt" article, here is the LLM's analysis of the most prescient comment. The analysis seems accurate and helpful to me.

https://karpathy.ai/hncapsule/2015-12-03/index.html#article-...

  phire

  > “Oculus might end up being the most successful product/company to be kickstarted… > Product wise, Pebble is the most successful so far… Right now they are up to major version 4 of their product. Long term, I don't think they will be more successful than Oculus.”

  With hindsight:

  Oculus became the backbone of Meta’s VR push, spawning the Rift/Quest series and a multi‑billion‑dollar strategic bet.
  Pebble, despite early success, was shut down and absorbed by Fitbit barely a year after this thread.

  That’s an excellent call on the relative trajectories of the two flagship Kickstarter hardware companies.
show 2 replies
andy99last Wednesday at 10:37 PM

I haven’t looked at the output yet, but came here to say,LLM grading is crap. They miss things, they ignore instructions, bring in their own views, have no calibration and in general are extremely poorly suited to this task. “Good” LLM as a judge type products (and none are great) use LLMs to make binary decisions - “do these atomic facts match yes / no” type stuff - and aggregate them to get a score.

I understand this is just a fun exercise so it’s basically what LLMs are good at - generating plausible sounding stuff without regard for correctness. I would not extrapolate this to their utility on real evaluation tasks.