Related from yesterday: Show HN: Gemini Pro 3 imagines the HN front page 10 years from now - https://news.ycombinator.com/item?id=46205632
This is a cool idea. I would install a Chrome extension that shows a score by every username on this site grading how well their expressed opinions match what subsequently happened in reality, or the accuracy of any specific predictions they've made. Some people's opinions are closer to reality than others and it's not always correlated with upvotes.
An extension of this would be to grade people on the accuracy of the comments they upvote, and use that to weight their upvotes more in ranking. I would love to read a version of HN where the only upvotes that matter are from people who agree with opinions that turn out to be correct. Of course, only HN could implement this since upvotes are private.
'pcwalton, I'm coming for you. You're going down.
Kidding aside, the comments it picks out for us are a little random. For instance, this was an A+ predictive thread (it appears to be rating threads and not individual comments):
https://news.ycombinator.com/item?id=10703512
But there's just 11 comments, only 1 for me, and it's like a 1-sentence comment.
I do love that my unaccredited-access-to-startup-shares take is on that leaderboard, though.
I've spent a weekend making something similar for my gmail account (which google keeps nagging me about being 90% full). It's fascinating to be able to classify 65k+ of emails (surprise: more than half are garbage), as well as summarize and trace the nature of communication between specific senders/recipients. It took about 50 hours on a dual RTX 3090 running Qwen 3.
My original goal was to prune the account deleting all the useless things and keeping just the unique, personal, valuable communications -- but the other day, an insight has me convinced that the safer / smarter thing to do in the current landscape is the opposite: remove any personal, valuable, memorable items, and leave google (and whomever else is scraping these repositories) with useless flotsam of newsletters, updates, subscription receipts, etc.
One thing this really highlights to me is how often the "boring" takes end up being the most accurate. The provocative, high-energy threads are usually the ones that age the worst.
If an LLM were acting as a kind of historian revisiting today’s debates with future context, I’d bet it would see the same pattern again and again: the sober, incremental claims quietly hold up, while the hyperconfident ones collapse.
Something like "Lithium-ion battery pack prices fall to $108/kWh" is classic cost-curve progress. Boring, steady, and historically extremely reliable over long horizons. Probably one of the most likely headlines today to age correctly, even if it gets little attention.
On the flip side, stuff like "New benchmark shows top LLMs struggle in real mental health care" feels like high-risk framing. Benchmarks rotate constantly, and “struggle” headlines almost always age badly as models jump whole generations.
I bet theres many "boring but right" takes we overlook today and I wondr if there's a practical way to surface them before hindsight does
I noticed the Hall of Fame grading of predictive comments has a quirk? It grades some comments about if they came true or not, but in the grading of comment to the article
https://news.ycombinator.com/item?id=10654216
The Cannons on the B-29 Bomber "accurate account of LeMay stripping turrets and shifting to incendiary area bombing; matches mainstream history"
It gave a good grade to user cstross but to my reading of the comment, cstross just recounted a bit of old history. The evaluation gave cstross for just giving a history lesson or no?
"the distributed “trillions of Tamagotchi” vision never materialized"
I begrudgingly accept my poor grade.
I am surprised the author thought the project passed quality control. The LLM reviews seem mostly false.
Looking at the comment reviews on the actual website, the LLM seems to have mostly judged whether it agreed with the takes, not whether they came true, and it seems to have an incredibly poor grasp of it's actual task of accessing whether the comments were predictive or not.
The LLM's comment reviews are of often statements like "correctly characterized [program language] as [opinion]."
This dynamic means the website mostly grades people on having the most confirmist take (the take most likely to dominate the training data, and be selected for in the LLM RL tuning process of pleasing the average user).
Quick give everyone colors to indicate their rank here and ban anyone with a grade less than C-.
Seriously, while I find this cool and interesting, I also fear how these sorts of things will work out for us all.
#272, I got a B+! Neat.
It would be very interesting to see this applied year after year to see if people get better or worse over time in the accuracy of their judgments.
It would also be interesting to correlate accuracy to scores, but I kind of doubt that can be done. Between just expressing popular sentiment and the first to the post people getting more votes for the same comment than people who come later it probably wouldn’t be very useful data.
So where do I collect my prize for this 2015 comment? https://news.ycombinator.com/item?id=9882217
Predictions are only valuable when they're actually made ahead of the knowledge becoming available. A man will walk on mars by 2030 is falsifiable, a man will walk on mars is not. A lot of these entries have very low to no predictive value or were already known at the time, but just related. Would be nice if future 'judges' put in more work to ensure quality judgments.
I would grade this article B-, but then again, nobody wrote it... ;)
10 Years Ago, December 11, 2015 - Introducing Open AI -- very meta: https://karpathy.ai/hncapsule/2015-12-11/index.html#article-...
The company has changed and it seems the mission has as well.
Anyone have a branch that I can run to target my own comments? I'd love to see where I was right and where I was off base. Seems like a genuinely great way to learn about my own biases.
Notable how this is only possible because the website is a good "web citizen." It has urls that maintain their state over a decade. They contain a whole conversation. You don't have to log in to see anything. The value of old proper websites increases with our ability to process them.
The analysis of the 2015 article about Triplebyte is fascinating [1]. Particularly the Awards section.
1. https://karpathy.ai/hncapsule/2015-12-08/index.html#article-...
It somehow feels right to see what GPT-5 thinks of the article titled "Machine learning works spectacularly well, but mathematicians aren’t sure why" and its discussion: https://karpathy.ai/hncapsule/2015-12-04/index.html#article-...
> https://karpathy.ai/hncapsule/2015-12-24/index.html#article-...
I wonder why ChatGPT refused to analyze it?
The HN article was "Brazil declares emergency after 2,400 babies are born with brain damage" but the page says "No analysis available".
> I realized that this task is actually a really good fit for LLMs
I've found the opposite, since these models still fail pretty wildly at nuance. I think it's a conceptual "needle in the haystack sort of problem.
A good test is to find some thread where there's a disagreement and have it try to analyze the discussion. It will usually strongly misrepresent what was being said, by each side, and strongly align with one user, missing the actual divide that's causing the disagreement (a needle).
It doesn't look like the code anonymizes usernames when sending the thread for grading. This likely induces bias in the grades based on past/current prevailing opinions of certain users. It would be interesting to see the whole thing done again but this time randomly re-assigning usernames, to assess bias, and also with procedurally generated pseudonyms, to see whether the bias can be removed that way.
I'd expect de-biasing would deflate grades for well known users.
It might also be interesting to use a search-grounded model that provides citations for its grading claims. Gemini models have access to this via their API, for example.
Commenters of HN:
Your past thoughts have been dredged up and judged.
For each $TOPIC, you have been awarded a grade by GPT-5.1 Thinking.
Your grade is based on OpenAI's aligned worldview and what OpenAI's blob of weights considers Truth in 2025.
Did you think well, netizen?
Are you an Alpha or a Delta-Minus?
Where will the dragnet grading of your online history happen next?
I understand the exercise, but I think it should have a disclaimer, some of the LLM reviews are showing a bias and when I read the comments they turned out not to be as bad as the LLM made them. As this hits the front page, some people will only read the title and not the accompanying blog post, losing all of the nuance.
That said, I understand the concept and love what you did here. By this being exposed to the best disinfectant, I hope it will raise awareness and show how people and corporations should be careful about its usage. Now this tech is accessible to anyone, not only big techs, in a couple of hours.
It also shows how we should take with a grain of salt the result of any analysis of such scale by a LLM. Our private channels now and messages on software like Teams and Slack can be analyzed to hell by our AI overlords. I'm probably going to remove a lot of things from cloud drives just in case. Perhaps online discourse will deteriorate to more inane / LinkedIn style content.
Also, I like that your prompt itself has some purposefully leaked bias, which shows other risks—¹for instance, "fsflover: F", which may align the LLM to grade worse the handles that are related to free software and open source).
As a meta concept of this, I wonder how I'll be graded by our AI overlords in the future now that I have posted something dismissive of it.
¹Alt+0151
Interesting that for the "December 16 2015 geohot is building Comma" it graded geohot's comments on the thread as only B
Looking at the results and the prompt, I would tweak the prompt to
* ignore comments that do not speculate on something that was unknown or had not achieved consensus as of the date of yyyy-mm-dd
* at the same time, exclude speculations for which there still isn’t a definitive answer or consensus today
* ignore comments that speculate on minor details or are stating a preference/opinion on a subjective matter
* it is ok to generate an empty list of users for a thread if there are no comments meeting the speculation requirements laid out above
* etc
I have never felt less confident in the future than I do in 2025... and it's such a stark contrast. I guess if you split things down the middle, AI probably continues to change the world in dramatic ways but not in the all or nothing way people expect.
A non trivial amount of people get laid off, likely due to a finanical crisis which is used as an excuse for companies scale up use of AI. Good chance the financial crisis was partly caused by AI companies, which ironically makes AI cheaper as infra is bought up on the cheap (so there is a consolidation, but the bountiful infra keeps things cheap). That results in increased usage (over a longer period of time). and even when the economy starts coming back the jobs numbers stay abismal.
Politics are divided into 2 main groups, those who are employed, and those who are retired. The retired group is VERY large, and has alot of power. They mostly care about entitlements. The employed age people focus on AI which is making the job market quite tough. There are 3 large political forces (but 2 parties). The Left, the Right, and the Tech Elite. The left and the right both hate AI, but the tech elite though a minority has outsized power in their tie breaker role. The age distributions would surprise most. Most older people are now on the left, and most younger people are split by gender. The right focuses on limiting entitlements, and the left focuses on growing them by taxing the tech elite. The right maintains power by not threatening the tech elite.
Unlike the 20th century America is a more focused global agenda. We're not policing everyone, just those core trading powers. We have not gone to war with China, China has not taken over Taiwan.
Physical robotics is becoming a pretty big thing, space travel is becoming cheaper. We have at least one robot on an astroid mining it. The yield is trivial, but we all thought it was neat.
Energy is much much greener, and you wouln't have guessed it... but it was the data centers that got us there. The Tech elite needed it quickly, and used the political connections to cut red tape and build really quickly.
Nice! Something must be in the air – last week I built a very similar project using the historical archive of all-in podcast episodes: https://allin-predictions.pages.dev/
I believe that the GPA calculation is off, maybe just for F's.
I scrolled to the bottom of the hall of fame/shame and saw that entry #1505 and 3 F's and a D, with an average grade of D+ (1.46).
No grade better than a D shouldn't average to a D+, I'd expect it to be closer to a 0.25.
I often summarise HN comments (which are sometimes more insightful than the original article) using an LLM. Total game-changer.
UX feedback: I wish clicking on a new thread scrolled right side to the top again
reading from the end isn't really useful, y'know :)
Gotta auto grade every HN comment for how good it is at predicting stock market movement then check what the "most frequently correct" user is saying about the next 6 months.
Many people are impressed by this, and I can see why. Still, this much isn't surprising: the Karpathy + LLM combo can deliver quickly. But there are downsides of blazing speed.
If you dig in, there are substantial flaws in the project's analysis and framing, such as the definition of a prediction, assessing comments, data quality overall, and more. Go spelunking through the comments here and notice people asking about methodology and checking the results.
Social science research isn't easy; it requires training, effort, and patience. I would be very happy if Karpathy added a Big Flashing Red Sign to this effect. It would raise awareness and focus community attention on what I think are the hardest and most important aspects of this kind of project: methodology, rigor, criticism, feedback, and correction.
This is great! Now I want to run this to analyze my own comments and see how I score and whether my rhetoric has improved in quality/accuracy over time!
> I spent a few hours browsing around and found it to be very interesting.
This seems to be the result of the exercise? No evaluation?
My concern is that, even if the exercise is only an amusing curiosity, many people will take the results more seriously than they should, and be inspired to apply the same methods to products and initiatives that adversely affect people's lives in real ways.
Reading this I feel the same sense of dread I get watching those highly choreographed Chinese holiday drone shows.
> And then when you navigate over to the Hall of Fame, you can find the top commenters of Hacker News in December 2015, sorted by imdb-style score of their grade point average.
Now let's make a Chrome extension that subtly highlights these users' comments when browsing HN.
On the site itself:
it's great that this was produced in 1h with 60$. This is amazing to create small utilities, explore your curiosity, etc.
But the site is also quite confusing and messy. OK for a vibe coded experiment, sure, but wouldn't be for a final product. But I fear we're gonna see more and more of this. Big companies downsizing their tech departments and embracing vibe coded. Comparing to inflation, shrinkflation and skimpflation/ enshittification , will we soon adopt some word for this? AIflation? LLMflation?
And how will this comment score in a couple of years? :)
> I was reminded again of my tweets that said "Be good, future LLMs are watching". You can take that in many directions, but here I want to focus on the idea that future LLMs are watching. Everything we do today might be scrutinized in great detail in the future because doing so will be "free". A lot of the ways people behave currently I think make an implicit "security by obscurity" assumption. But if intelligence really does become too cheap to meter, it will become possible to do a perfect reconstruction and synthesis of everything. LLMs are watching (or humans using them might be). Best to be good.
Can we take a second and talk about how dystopian this is? Such an outcome is not inevitable, it relies on us making it. The future is not deterministic, the future is determined by us. Moreso, Karpathy has significantly more influence on that future than your average HN user.We are doing something very *very* wrong if we are operating under the belief that this future is unavoidable. That future is simply unacceptable.
Why not rank ESP for each HN user, with evidence?
Neat, I got a shout-out. Always happy to share the random stuff I remember exists!
A majority don't seem to be predictions about the future, and it seems to mostly like comments that give extended air to what was then and now the consensus viewpoint, e.g. the top comment from pcwalton the highest scored user: https://news.ycombinator.com/item?id=10657401
> (Copying my comment here from Reddit /r/rust:) Just to repeat, because this was somewhat buried in the article: Servo is now a multiprocess browser, using the gaol crate for sandboxing. This adds (a) an extra layer of defense against remote code execution vulnerabilities beyond that which the Rust safety features provide; (b) a safety net in case Servo code is tricked into performing insecure actions. There are still plenty of bugs to shake out, but this is a major milestone in the project.
"If LLMs are watching, humans will be on their best behavior". Karpathy, paraphrasing Larry Ellison.
The EU may give LLM surveillance an F at some point.
I think the most fun thing is to go to: https://karpathy.ai/hncapsule/hall-of-fame.html
And scroll down to the bottom.
Cool - now make it analyze all of those and come up with the 10 commandments of commenting factually and insightfully on HN posts...
I'd love to see an "Annie Hall" analysis of hn posts, for incidents where somebody says something about some piece of software or whatever, and the person who created it replies, like Marshall McLuhan stepping out from behind a sign in Annie Hall.
Does anyone else think that HN engages in far too much navel-gazing? Nothing gets upvotes faster than a HN submission about HN.
This is a perfect example of the power and problems with LLMs.
I took the narcissistic approach of searching for myself. Here's a grade of one of my comments[1]:
>slg: B- (accurate characterization of PH’s “networking & facade” feel, but implicitly underestimates how long that model can persist)
And here's the actual comment I made[2]:
>And maybe it is the cynical contrarian in me, but I think the "real world" aspect of Product Hunt it what turned me off of the site before these issues even came to the forefront. It always seemed like an echo chamber were everyone was putting up a facade. Users seemed more concerned with the people behind products and networking with them than actually offering opinions of what was posted.
>I find the more internet-like communities more natural. Sure, the top comment on a Show HN is often a critique. However I find that more interesting than the usual "Wow, another great product from John Developer. Signing up now." or the "Wow, great product. Here is why you should use the competing product that I work on." that you usually see on Product Hunt.
I did not say nor imply anything about "how long that model can persist", I just said I personally don't like using the site. It's a total hallucination to claim I was implying doom for "that model" and you would only know that if you actually took the time to dig into the details of what was actually said, but the summary seems plausible enough that most people never would.
The LLM processed and analyzed a huge amount of data in a way that no human could, but the single in-depth look I took at that analysis was somewhere between misleading and flat out wrong. As I said, a perfect example of what LLMs do.
And yes, I do recognize the funny coincidence that I'm now doing the exact thing I described as the typical HN comment a decade ago. I guess there is a reason old me said "I find that more interesting".
[1] - https://karpathy.ai/hncapsule/2015-12-18/index.html#article-...
>I believe it is quite possible and desirable to train your forward future predictor given training and effort.
That's interesting. I wouldn't have thought that a decent generic forward future predictor would be possible.
Now: compared to what? Is there a better source than HN? How's it compare to Reddit or lobsters?
Compared to what happens next? Does tptacek's commentary become market signal equivalent to the Fed Chair or the BLS labor and inflation reports?
It's fun to read some of these historic comments! A while back I wrote a replay system to better capture how discussions evolved at the time of these historic threads. Here's Karpathy's list from his graded articles, in the replay visualizer:
Swift is Open Source https://hn.unlurker.com/replay?item=10669891
Launch of Figma, a collaborative interface design tool https://hn.unlurker.com/replay?item=10685407
Introducing OpenAI https://hn.unlurker.com/replay?item=10720176
The first person to hack the iPhone is building a self-driving car https://hn.unlurker.com/replay?item=10744206
SpaceX launch webcast: Orbcomm-2 Mission [video] https://hn.unlurker.com/replay?item=10774865
At Theranos, Many Strategies and Snags https://hn.unlurker.com/replay?item=10799261