I think maybe you're misunderstanding the issue here. I have loads of data, but I'm unwilling to send it to 3rd parties, so that leaves me with gathering/generating the training data locally, but none of the models are good/strong enough for that today.
I'd love to "send them to go looking for stuff for you", but local models aren't great at this today, even with beefy hardware, and since that's about my only option, that leaves me unable to get sessions to use for the fine-tuning in the first place.
Right, that's exactly the situation I'm in too and "send them to go looking for stuff for you" without it going off the rails is the problem we've been working on.
Basically you need a squad of specialized models to do this in a mostly-structured way that ends up looking kind of like a crawling or scraping/search operation. I can share a stack of about 5-6 that are working for us directly if you want, I want to keep the exact stack on the DL for now but you can check my company's recent github activity to get an idea of it. It's basically a "browser agent" where gemma or qwen guide the general navigation/summarization but mostly focus on information extraction and normalization.
The other thing I've done, which obviously not everybody is going to want to do, is create emails and browser profiles for the browser agent (since they basically work when I'm not on the computer, but need identity to navigate the web) and run them on devices that don't have the keys to the kingdom. I also give them my phone number and their own (via an endpoint they can only call me from). That way if they run into something they have a way to escalate it, and I can do limited steering out of the loop. Obviously this is way more work than is reasonable for most people right now though so I'm hoping to show people a proper batteries-included setup for it soon.
Edit: Based on your other comment, I think maybe what you're really looking for most are "personal traces". Right now that's something we're working on with https://github.com/accretional/chromerpc (which uses the lower level Chrome Devtools Protocol rather than Puppetteer to basically fully automate web navigation, either through an LLM or prescriptive workflows). It would be very simple to set up automation to take a screenshot and save it locally every Xm or in response to certain events and generate traces for yourself that way, if you want. That alone provides a pretty strong base for a personal dataset.