logoalt Hacker News

hu3yesterday at 5:16 PM1 replyview on HN

Great writeup. Thanks for talking the time to organise and share.

It's tempting to use this in projects that use PHP.

Is it useable with a corpus of like 1.000 3kb markdown files? And 10.000 files?

Can I also index PHP files so that searches include function and class names? Perhaps comments?

How much ram and disk memory we would be talking about?

And the speed?

My first goal would to index a PHP project and its documentation so that an LLM agent could perform semantic search using my MCP tool.


Replies

centamivyesterday at 5:27 PM

I tested it myself with 1k documents (about 1.5M vectors) and performance is solid (a few milliseconds per search). I haven't run more aggressive benchmarks yet.

Since it only stores the vectors, the actual size of the Markdown document is irrelevant; you just need to handle the embedding and chunking phases carefully (you can use a parser to extract code snippets).

RAM isn't an issue because I aim for random data access as much as possible. This avoids saturating PHP, since it wasn't exactly built for this kind of workload.

I'm glad you found the article and repo useful! If you use it and run into any problems, feel free to open an issue on GitHub.