Agreed - AI that could take care of this sort of cross-system complexity and automation in a reliable way would be actually useful. Unfortunately I've yet to use an AI that can reliably handle even moderately complex text parsing in a single file more easily than if I'd just done it myself from the start.
This reminds me of a paper: "The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators"
https://arxiv.org/abs/2407.11004
In essence, LLMs are quite good at writing the code to properly parse large amounts of unstructured text, rather than what a lot of people seem to be doing which is just shoveling data into an LLM's API and asking for transformations back.
Yes. It’s very frustrating. Like there is a great need for a kind of data pipeline test suite where you can iterate through lots of different options and play around with different data manipulations so a single person can do it. Because it’s not worth it to really build it if it doesn’t work. There needs to be one of these astronomer/dagster/apache airflow/azure ml tools that are quick and dirty to try things out. Maybe I’m just naive and they exist and I’ve had my nose in Jupyter notebooks. But I really feel hindered these days in my ability to prototype complex data pipelines myself while also considering all of the other parts of the science.