Posting this under a burner so I don't dox myself: I work in FinTech on a regulated product. We have access to Mythos. Mythos identified part of our codebase that it confidently asserted was not complaint with a particular regulation and we were at grave risk by allowing it to operate the way it was.
Except this was not the case, it had of course hallucinated what the regulation actually required (I know this because the code in question had already been reviewed by human counsel). This is (supposedly) the most bleeding-edge model available.
We use a lot of genAI to help us write code, but there is no way in the mid-term we could ever rely on these tools to actually build compliant financial products. We'd have to be totally mad. Yes, lots of Fintech companies are using these agents to accelerate, but anyone who's using them to actually ship product without a human actually digging into it is opening themselves up to a world of risk.
It was my impression that a whole lot of products are only pretending to be compliant, and that it's much more profitable to operate like that.
3 years max. Maybe 5 if you are lucky.The models will continue to improve. The exponential gains in compute efficiency that have been ongoing for 70+ years will continue and that will result in even smarter models. There are dramatic hardware changes in the pipeline.
But really that particular issue could have been solved by literally just telling it in a markdown file or instructions something like "verify all facts or compliance requirements with web search and include citations in responses".
IMHO even if we are using auditing tools I believe we must use deterministic tools for critical analysis like this. Such rule and pattern based systems may not scale beyond certain point but they can be accurate.
The dynamic of agent codes human reviews does seem like the only sane one for the foreseeable future. Even Anthropic themselves still fall back to this.
The problem is that sucks, even if all software engineers keep their jobs and salaries, the floor is still pulled out from under us. Imagine if a surgeons job was to supervise robot surgeons from a remote computer, or a woodworker just signs off on work before the machines do all the cutting and assembly. Sure they still have important jobs in their field but the soul & humanity of their skill is gone.
I've worked on projects in the airline and health industry which are highly regulated too. The regulations can be incredibly difficult to process and implement, and make sure you adhere to everything correctly. I've been involved in multiple scenarios where people have made false assertions about compliance or lack of. I'd still place a bet that the SOA models make _far_ less mistakes than humans.
In some sense, you should still act on this, since if an external auditor relies on the same stack, it'll still cause you headaches.
I use Opus 4.8 and GPT 5.5 and haven't suffered from hallucinations in months. But we also put a lot of effort into our harness.
> it had of course hallucinated what the regulation actually required
Did it do the correct job once you put the regulations doc(s) in the context?
100%. Unfortunately those not in the depths of mission critical systems or regulated products will continue to believe that producing tons of code quickly using LLMs without humans in these systems is acceptable.
Here's an example of what we will continue to see with folks fully immersed in gen AI psychosis:
"The creator of claude code said that he no longer writes code for about 6 months and now has Claude doing all his work now. He also said recently that he no longer prompts Claude and now has it running in loops and it is self-improving itself and performing better than a human!"
If the code produced by the LLM is perfect, the LLM takes the credit. But when a disaster happens, you cannot blame the LLM and it then falls on the human who did it.
I don't think SWEs heavily vibe-coding with LLMs realize the risk in not understanding what the code the LLM being produced is doing even after generating tests (lol). We will see more of this too. [0]
[0] https://sketch.dev/blog/our-first-outage-from-llm-written-co...
False-positive rate is so high with Mythos according to friends and other reporting I have seen.
The original Mythos release used ASan to filter false-positives so it was able to maintain a good FPR, but when Mythos moves into domains that don't have a readily available oracle to help filter hits, the result is a deluge of false bullshit.
Have you added "Make no mistakes" to the proompt? Mythos can't go wrong then, must be a skill issue.
what am i missing?
you take a spec and create tests, every little thing
you use another ai to verify these tests against the spec
you review the tests vs the spec (at one point human review)
you put the tests off limits to change / wall them.
you let the ai write the software that fulfills the tests.
there will be some gaps where you repeat the cycle above
if the tests fulfill the spec, the code will fulfill the spec
Is that all that Mythos did?
Did it find any real potential issue, optimization/simplification opportunities, or sparked any thought-provoking discussion within your organization?
Or was it purely a net negative experience?
Isn't that a net positive though? (not sure about the cost human and tech cost). I'm guessing that without using Mythos, those conversations would never have been had, and confidence in the compliance of the product would've been lower.
I love using AI tools as casinos. It's epic in helping to forge ideas and kickstart thought processes. You basically have the entirety of world knowledge at your fingertips to have a pint with.
I have worked on highly regulated areas in finance (risk). Compliance is a highly creative art, often requiring lots of out-of-the-box thinking and non-obvious solutions. The people I found worst at this were IT. They tend to over-interpret regulation, and super-restrict beyond what is needed for actual de-facto compliance.
My guess is the model makes the same mistakes as the programmers: taking 'rules' literally, unaware of sectoral joint understanding, validated interpretations and habits. (btw. this is often on the non-tech side also a difference between regulatory and legal. The former are much more result oriented while the latter are primarily risk averse.