The benchmarks aren't great, they're super specific to sem's output: why would I ask Claude how many "entities" were modified by a commit and do I need a tool specifically for this request ? Note that an "entity" is a sem-specific concept...
Thanks for pointing it out. I agree with you here, my testing process was quite specific to sem's output but also would love any suggestion from you of how you would design the whole testing process for this kind of tool?
I can also give my thought process, because I was more interested in figuring out the model's inherent search results and understanding without sem.
Thanks for pointing it out. I agree with you here, my testing process was quite specific to sem's output but also would love any suggestion from you of how you would design the whole testing process for this kind of tool?
I can also give my thought process, because I was more interested in figuring out the model's inherent search results and understanding without sem.