In this case, you verify whether the knowledge was made up by comparing the virtual waiter behaviour to the actual waiter. Having a strong test suite like that is actually the ideal scenario for agentic development.
(It still incredibly hard to pull off for real, because of complex stateful protocols and edge cases around timing and transfer sizes. Samba did take 12 years to develop, so even with LLM help you'd probably still be looking at several years.)