The other day I've clicked on one of Outlook calendar's copilot prefilled questions: "who are the main attendees of this meeting". It started a long winding speech that went nowhere, so I typed in "but WHO are the attendees" and finally it admitted "I don't know, I can't see that".
It's so easy to ship completely broken AI features because you can't really unit test them and unit tests have been the main standard for whether code is working for a long time now.
The most successful AI companies (OpenAI, Anthropic, Cursor) are all dogfooding their products as far as I can tell, and I don't really see any other reliable way to make sure the AI feature you ship actually works.
Kudos to you. I never use new buttons out of fear of something irreversible happening, like sending a random email or deleting something. I still feel uncomfortable with the Gmail UX, I would _never_ use a ”hello iz magic ai”-button.
I've fooled around with some vibe coding on several LLM's like Claude, Gemini, and ChatGPT with some pretty decent results.
Since I have a full Copilot license at my corporate day gig, I figured I would try using Copilot for a basic static site. Nothing too hard, and something that's been handled easily with the other LLM's.
The prompt was pretty basic just to get something to start working with. "Build a four page template. With a home or index page, two pages of content and a contact page with a responsive slide out menu from the left hand side of the page."
It ran and put everything in a folder. I open the home page and everything was broken. I opened the files in VS Code and saw this:
<ul class="drawer__list">
<li>index.htmlHome</a></li>
<li>services.htmlServices</a></li>
<li><a class="nav-linkeduling</a></li>
<li>contact.htmlContact</a></li>
</ul>
And then this: <head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Home · Acme Web</title>
<meta name="description" content="Accessible, responsive starter template with a slide-out menu."/>
<linkts/css/styles.css
/assets/css/styles.css
</head>
I mean, if you can't even this right, I don't have much hope it can do anything more complicated. To say this was pretty sad is an understatement and clarified how far Microsoft is behind other LLM's.Sounds like Siri - unable to control much of anything on the iPhone outside of reading/sending text messages and setting alarms.
Me: Can you access my inbox and Teams messages?
Copilot: Yep!
Me: Please find any items in my inbox or sent items indicating (a) that I have agreed to take on a task or (b) identifying me as the person responsible for a task, removing duplicates and any items that I have unambiguously replied to via email or Teams. Time window is preceding 7 days.
Copilot: Prints a list with, at best, 5% accuracy
I know some folks have the peculiar idea that search is dead in favor of AI, but if AI can't accurately find information, it is useless. As near as I can tell, Copilot finds 3-4 items (but rarely the SAME 3-4 items across runs) and calls it a day. It just seems like nobody is actually testing any of this stuff. Microsoft is actively destroying its credibility because it's offering a tool with a party trick but is utterly unreliable. I will, therefore, not rely on it.
I asked Microsoft 365 Copilot to create a new word document for me (since they have hidden the link on office.com) and... it refused to do that.
Edit: Just tried again. It refused to do it. I mean WTF.
Absolutely! There are so many scenarios where they could actually add some value, and they're fulfilling, like, exactly none of those?
Even in Visual Studio Enterprise, their flagship developer product, the GPT integration mostly just destroys code regardless of model output. I truly cannot fathom how any of that made it past even a cursory review. Or how that situation would last for over 6 months, but, yet, here we are.
And, again, it's fine with me: I'll just use Claude Code, but if I were a Microsoft VP-or-above, the lack of execution would sort-of, well concern me? But maybe I'm just focused on the wrong things. I mean, Cloudflare brought down, like, half the Internet twice in the past two weeks, and they're still a tech darling, so possibly incompetence is the new hotness now?