Show HN: Use Claude Code to Query 600 GB Indexes over Hacker News, ArXiv, etc.

236 points • by Xyra • today at 7:47 AM • 66 comments • view on HN

Paste in my prompt to Claude Code with an embedded API key for accessing my public readonly SQL+vector database, and you have a state-of-the-art research tool over Hacker News, arXiv, LessWrong, and dozens of other high-quality public commons sites. Claude whips up the monster SQL queries that safely run on my machine, to answer your most nuanced questions.

There's also an Alerts functionality, where you can just ask Claude to submit a SQL query as an alert, and you'll be emailed when the ultra nuanced criteria is met (and the output changes). Like I want to know when somebody posts about "estrogen" in a psychoactive context, or enough biology metaphors when talking about building infrastructure.

Currently have embedded: posts: 1.4M / 4.6M comments: 15.6M / 38M That's with Voyage-3.5-lite. And you can do amazing compositional vector search, like search @FTX_crisis - (@guilt_tone - @guilt_topic) to find writing that was about the FTX crisis and distinctly without guilty tones, but that can mention "guilt".

I can embed everything and all the other sources for cheap, I just literally don't have the money.

Comments

barishnamazov • today at 9:38 AM

I like that this relies on generating SQL rather than just being a black-box chat bot. It feels like the right way to use LLMs for research: as a translator from natural language to a rigid query language, rather than as the database itself. Very cool project!

Hopefully your API doesn't get exploited and you are doing timeouts/sandboxing -- it'd be easy to do a massive join on this.

I also have a question mostly stemming from me being not knowledgeable in the area -- have you noticed any semantic bleeding when research is done between your datasets? e.g., "optimization" probably means different things under ArXiv, LessWrong, and HN. Wondering if vector searches account for this given a more specific question.

arjie • today at 6:47 PM

This is very cool. If you're productizing this you should try to target a vertical. What does "literally don't have the money" mean? You should try to raise some in the traditional way. If nothing else works, at least try to apply to YC.

➕ show 1 reply

r--w • today at 6:53 PM

I could be distributed as a Claude skill. Internally, we've bundled a lot of external APIs and SQL queries into skills that are shared across the company.

bonsai_spool • today at 1:05 PM

This may exist already, but I'd like to find a way to query 'Supplementary Material' in biomedical research papers for genes / proteins or even biological processes.

As it is, the Supplementary Materials are inconsistently indexed so a lot of insight you might get from the last 15 years of genomics or proteomics work is invisible.

I imagine this approach could work, especially for Open Access data?

➕ show 1 reply

nielsole • today at 11:20 AM

I think a prompt + an external dataset is a very simple distribution channel right now to explore anything quickly with low friction. The curl | bash of 2026

➕ show 1 reply

kburman • today at 10:45 AM

> a state-of-the-art research tool over Hacker News, arXiv, LessWrong, and dozens

what makes this state of the art?

➕ show 4 replies

7777777phil • today at 9:37 AM

Really useful currently working on a autonomous academic research system [1] and thinking about integrating this. Currently using custom prompt + Edison Scientific API. Any plans of making this open source?

[1] https://github.com/giatenica/gia-agentic-short

lasgawe • today at 7:07 PM

I need to try this

nineteen999 • today at 9:40 AM

That's just not a good use of my Claude plan. If you can make it so a self-hosted Lllama or Qwen 7B can query it, then that's something.

➕ show 1 reply

lastdong • today at 4:19 PM

Anyone tried to use these prompts with Gemini 3 Pro? it feels like Claude, Gemini and GPT latest offerings are on par (excluding costs) and as a developer if you know how to query/spec a coder llm you can move between them at ease.

➕ show 1 reply

anonfunction • today at 4:53 PM

Seems like you're experiencing the hacker news hug of death.

➕ show 1 reply

legohorizons • today at 6:20 PM

Do you have contact information? Would like to discuss sponsoring further work and embedding here.

voxleone • today at 2:09 PM

this is great>>@FTX_crisis - (@guilt_tone - @guilt_topic)

Using LLm for tasks that could be done faster with traditional algorithmic approaches seems wasteful, but this is one of the few legitimate cases where embeddings are doing something classical IR literally cannot. You could also make make the LLM explain the query it’s about to run. Before execution:

“Here’s the SQL and semantic filters I’m about to apply. Does this match your intent?”

➕ show 1 reply

fragmede • today at 11:33 AM

> I can embed everything and all the other sources for cheap, I just literally don't have the money.

How much do you need for the various leaks, like the paradise papers, the panama papers, the offshore leajay, the Bahamas leaks, the fincen files, the Uber files, etc. and what's your Venmo?

mentalgear • today at 9:40 AM

Nice, but would you consider open-sourcing it? I (and I assume others) are not keen on sharing my API keys with a 3rd party.

darlontrofy • today at 5:04 PM

It's a very nifty cool, and could definitely come in handy. love the UX too!

m11a • today at 11:32 AM

The quick setup is cool! I’ve not seen this onboarding flow for other tools, and I quite like its simplicity.

gtsnexp • today at 10:00 AM

Is the appeal of this tool its ability to identify semantic similarity?

bugglebeetle • today at 9:31 AM

Seems very cool, but IMO you’d be better off doing an open source version and then hosted SAAS.

beepbooptheory • today at 2:53 PM

Does that first generated query really work? Why are you looking at URIs like that? First you filter for a uri match, then later filter out that same match, minus `optimization`, when you are doing the cosine distance. Not once is `mesa-optimization` even mentioned, which is supposed to be the whole point?

octoberfranklin • today at 10:05 AM

"Claude Code and Codex are essentially AGI at this point"

Okaaaaaaay....

➕ show 1 reply

alt Hacker News

Show HN: Use Claude Code to Query 600 GB Indexes over Hacker News, ArXiv, etc.

Comments