logoalt Hacker News

The Token Compression Illusion: Why I'm Skeptical of RTK

66 pointsby lackoftacticstoday at 5:37 PM73 commentsview on HN

Comments

cityofdelusiontoday at 6:17 PM

I am glad articles like this are finally starting to get some momentum around what I call the LLM magic box industry. From caveman mode to RTK to semantic search and everything in between. Developers have become magicians that cast spells instead of engineers. It sucks at work especially with everyone so sure that their magic spell is the one for ultimate token savings.

My criteria are: if it’s not in a harness it’s probably not that good (the best ideas float up to Codex/Claude imo) and any GitHub advertising some percent of token savings is not to be trusted.

It’s hard to avoid the snake oil and I hope people start thinking critically on this stuff.

show 11 replies
lackoftacticstoday at 6:24 PM

Author of the text here. I will be honest with why I wrote it, the rtk ai looks very odd to me as software engineer, the number of stars, no mention of accuracy and how management is pushing that stuff to optimize costs. Now people are wrapping every possible command in rtk and trying to handle all major possible commands and decide which output you should get.

show 2 replies
compuficialtoday at 6:14 PM

> 1. Gamified Savings vs. Your Actual API Bill

Tool use output represents a large amount of my output. I'll take 3.7M tokens saved on 3.9M tokens of input. Tokens saved are tokens saved.

> 3. Where Are the Accuracy Benchmarks?

As a user of RTK, it would be nice to see accuracy benchmarks. However, I've seen no evidence of the model missing anything critical as a result of the compression. As part of their design philosophy they are very strict about preserving correctness to the point that if a filter fails they fall back to raw output. For my most frequently used commands I've inspected the source, was happy with what I saw, they've earned my trust thus far.

> The day git, cargo, npm, or grep updates its terminal formatting by a few spaces or changes an error layout, RTK's regex and parsing filters will break. And returning to the silent failure trap, it won't throw an explicit error; it will fail quietly, feeding corrupted or partial text to your agent.

Again, any filter that fails simply falls back to the raw output. One of their core pillars is avoiding this exact scenario you described. RTK should never feed corrupted or partial text to an agent.

Your concerns are fair but I'd like to see your criticism backed up with evidence. Have you used RTK? Have you found evidence that they are failing to preserve correctness?

show 2 replies
giancarlostorotoday at 9:25 PM

I just typed in rtk gain on my Mac, unfortunately my main dev machine I reimaged due memory issues I had and it messing up a few things, but on my Mac I've shaved off roughly 51k input tokens, and 23k output tokens, and saved an average of 3 seconds per command. Not sure what the outrage is for or why they cared enough to write this up really.

Not sure who is piping stacktraces through RTK, I only use it for very specific programs, shoving compiler output through it seems silly, but you can always instruct your agent to only use RTK for very specific sets of commands.

cepheitoday at 9:26 PM

Many points about maintainability that this article makes seem to hold, especially with update and version output changes, but it doesn't even offer the simplest alternative. Most of these supported commands have flags to strip out noise and reduce output. Maybe agents aren't well trained on these.

As a side note, has anyone tried a dual agent setup where the command output is proxied through a lightweight local model? I can imagine a scenario where all tool output is filtered through Qwen or similar locally to compact the tool output.

ziyasaltoday at 8:40 PM

> Mainstream CLIs and developer tools can easily ship a native --compact or --json-stream flag tailored for LLM consumption.

Until they do, they won't soon , rtk, caveman, ponytail and many others are just trying to address every growing costs (for 2K org, its around 2.5M, for now), so these are trade-offs we are all know and adjusting, but unlike the author claims we know the trade-off well and forking these tools, benchmarking, verifying the output quality matches our needs and so on to make it work for us, so no blindly.

For solo devs, yes, they might not really need it, self hosting another model to save would be better option. But for orgs thats a spicy part.

Yes, its good that we see these articles are shedding some light but like we do with these tools, lets also consume these articles with a grain of salt.

jbellistoday at 10:51 PM

I feel bad that I wasted my time reading this.

On the points in the article:

1. Yes, "gain" is a vanity metric but it's harmless, nobody is being "fooled" here.

2. This could be a problem in principle, sure, but unless you're actually vetting bug reports you're just spreading FUD.

3. Again, do you have any reason to believe that the thousands of devs using rtk are silently tanking their performance without noticing? here's a thought: instead of reporting that SOMEONE SHOULD MEASURE THIS, you could, you know, measure it yourself.

4. Good lord, what is this doing in a purportedly technical article?

5. Yes, this is inherent in the problem domain, again, nobody is being "fooled".

Yes, I'm grumpy; reading this article was a waste of time.

Bias: had my first RTK pr accepted today, so I guess I probably know more about it than this guy who got offended by "gain" and spit out the first thoughts that came to mind.

show 2 replies
tlarkworthytoday at 6:22 PM

I tried it and it does not compress messages which was 90% of my context, so it only compresses a small part of my token usage. If you read it carefully you will realize that is exactly stated. If you look at /context you will probably see that tool calls are not where you are spending token on, so a proxy that compresses tool calls will not make much impact, whilst still being true that it compresses tool calls by 8x. Its just not that important for long coding sessions for me.

"native/built-in Read or cat tools, the data is not intercepted by RTK's shell hook"

trjordantoday at 8:03 PM

The core of the problem is that there are a million tools that make AI better, and no ways to measure whether AI is working better.

Big companies with popular products have it. They do something between normal product analytics and chatbot evals to figure out if users are being successful in their sessions. That's the job.

But any given dev, with between 3 and 50 sessions a day? Like, I have no idea what makes the LLM better. It's all vibes.

My company has a whole stack here. Preferred harnesses, preferred models, skills, the shape of our code, everything. There's gotta be a way to measure whether this setup is working for us, at 1 / 1-million-th the scale of a Claude Code.

show 1 reply
graphememestoday at 7:20 PM

I don't disagree with the article, but I also don't disagree with RTK. The output of these commands is not optimized for agents (or humans) for that matter.

arcanemachinertoday at 6:12 PM

I've been trying out RTK and it seems kinda alright. I doubt it's saving much, but the quality of the work feels similar.

But if it's making a dent in token usage (which I have not personally measured), then that's great.

I had to add some system prompt instructions to Pi to help it work (GPT 5.5 initially got confused when `git status` looked different than expected). The Claude Code extension appears to do a proper job of informing the agent about the unexpected shape of the output without any extra work on my part.

show 1 reply
old_sysadmintoday at 6:13 PM

I feel like the state of the art is baked into the compaction logic, and I've had a lot of problems with compaction (absent other prompting) losing key bits of state.

https://github.com/toon-format/toon is another interesting one, and I feel like it takes on a much more achievable goal - reduce whitespace and verbosity of JSON, not overall context compression.

show 1 reply
Catloafdevtoday at 6:21 PM

I don't agree with the conclusion at all. I can see the value of RTK - whether it is buggy or vibe coded is kind of secondary. That basically comes down to how severe and often the bugs are.

There's no gamification of savings here. Tool output can be meaty.

Is the author skeptical of the concept, or the implementation? Because only one of those is worth critiquing.

show 1 reply
SubiculumCodetoday at 5:54 PM

I feel like what is needed is not compression, but aggressive context management with subagents.

show 3 replies
blubbertoday at 6:25 PM

"Where Are the Accuracy Benchmarks?"

I wish the author would have provided one.

iam-TJtoday at 6:04 PM

Am I the only one that thought RTK was Real-Time Kinematics used for precision with satellite navigation?

show 1 reply
breadislovetoday at 5:52 PM

slop complaining about other slop

show 1 reply