logoalt Hacker News

docheinestagestoday at 5:05 PM2 repliesview on HN

We really need a "memory arena" to serve two important purposes:

1. List all the known agent memory projects (of which there are hundreds)\ 2. Objectively compare and score them both against each other and vanilla harnesses like Claude Code

Only then can I have the cognitive capacity to decide which one makes sense for me.


Replies

oleksiibondtoday at 6:08 PM

Agreed, and point number two is the tricky one. Creating a list of tasks is easy; evaluating them is not. You need a consistent task set, a "clean slate" control (i.e., Claude code without memory is your proper control) and an evaluation criteria which differentiates "uses fewer tokens" from "produces better results," otherwise you end up with vendors evaluating their own work.

Currently constructing a repeatable test harness for PMB: Fixed task, with/without memory, repeated N times, giving number of tokens/turns/passed/not passed with a subjective quality score too. Would be happy to share the task set and evaluation criteria for testing on anyone else's memory server or clean slate control, not just mine.

cyanydeeztoday at 5:11 PM

every time I see these memory agents, all I can think about is context bloat and posioning. We know humans have trouble with memories from a different realm: to "remember" something of significance, the human brain reconstructs the entire experience, which is why they're so easy to influence.

That seems to be what most of these systems are doing: amplifying erros and hallucinations more than anything else.

show 1 reply