Fast regex search: indexing text for agent tools

29 points • by jxmorris12 • last Tuesday at 6:31 AM • 6 comments • view on HN

Comments

I read this when it came out and having written similar things for searchcode.com (back when it was a spanning code search engine), and while interesting I have questions about,

    We routinely see rg invocations that take more than 15 seconds

The only way that works is if you are running it over repos 100-200 gigabytes in size, or they are sitting on a spinning rust HDD, OR its matching so many lines that the print is the dominant part of the runtime, and its still over a very large codebase.

Now I totally believe codebases like this exist, but surely they aren't that common? I could understand this is for a single customer though!

Where this does fall down though is having to maintain that index. That's actually why when I was working on my own local code search tool boyter/cs on github I also just brute forced it. No index no problems, and with desktop CPU's coming out with 200mb of cache these days it seems increasingly like a winning approach.

siva7 • yesterday at 11:03 PM

I don't get grep in agentic settings for natural language queries. You want to optimize for best results with as few tokens/round trips as possible, not for speed.

mpalmer • yesterday at 9:03 PM

> No matter how fast ripgrep can match on the contents of a file, it has one serious limitation: it needs to match on the contents of all files.

The omission of rg's `-g` parameter is unsurprising in one sense, because it would mostly obviate this entire exercise. How often do you need to search what sounds like hundreds of millions of lines of source for a complex pattern, with zero constraints on paths searched?

> We routinely see rg invocations that take more than 15 seconds

I'm trying to understand the monorepo that is so large that ripgrep takes 15 seconds to return results, when it's benchmarked as searching for a literal in a 9.3GB file in 600ms, or 1.08s to search for `.*` in the entire Linux repo.

And again, that's without using `-g`.

➕ show 2 replies

open-paren • yesterday at 9:56 PM

The creator of fff.nvim[0], Dmitriy Kovalenko, had an interesting analysis of this on Xitter[1]. The TL;DR of this is that Anysphere/Cursor is being somewhat disingenuous and does not include the index-creation and recreation time in the comparison nor do they include the CPU or memory overhead, where rg (and his tool, fff.nvim) are indexless.

---

0: https://github.com/dmtrKovalenko/fff.nvim

1: http://x.com/i/article/2036558670528651264

➕ show 1 reply

maxbeech • last Tuesday at 6:58 AM

[dead]

alt Hacker News

Fast regex search: indexing text for agent tools

Comments