Most of us were amused when DALL-E and its peers went mainstream, and we were quick to point out the obvious flaws.
Then ChatGPT hit the scene and again, many of us dismissed it as a parlor trick that would never amount to much.
Using LLMs for coding initially was a only small step up from basic code completion, and a welcome farewell to Stack Overflow.
I am curious: what was the specific moment that you went from those quaint, dismissive observations to a slightly panicked, "Uh Oh" realization of what these models can do?
I was never dismissive, it always seemed pretty cool at each step
Maybe in 2024 I was amazed to see it one shot unique snippets of code
I'm still waiting for a positive "Oh shit" moment regarding LLMs.
I've had plenty of "Oh shit those people have really lost all ability to think for themselves" moments though.
My first came in late 2016, when Google Translate switched from statistical machine translation to a neural-network-based system. I had worked as a Japanese-English translator and lexicographer for two decades, and I had been testing various machine-translation services over the years. For translation between Japanese and English, at least, they were uniformly terrible: the output for genuine texts was mostly incomprehensible and could not be used for any real-life applications. The neural Google Translate, while still far from perfect, was suddenly useful for some purposes.
But the neural models were still not translating meaning, which is the whole point of translation. I devised a variety of tests to see if GT could identify the meaning of ambiguous words from the context, and it couldn’t. One example I would show people was the sentences “I was born in 1998, and my sister was born in 1999” and “I was born in 1999, and my sister was born in 1998” translated into Japanese. Japanese uses different words for older and younger siblings, but GT translated “my sister” with the same word in both sentences. It was easy to come up with other examples where GT would fail, such as when the meaning of a word could only be determined based on context in a previous sentence; at that time, GT seemed to be translating sentence-by-sentence, with no consideration of what came before or after. I kept waiting to see whether computers would ever be able to handle meaning when translating, and for years thereafter there was little progress.
A minor shock came in mid-2022, when DALL-E 2 was released. Its ability to create images from natural-language prompts suggested that something deeper was going on than just statistical correlations. But I couldn’t see yet what the useful applications might be.
My biggest “oh shit” moment came with ChatGPT in late 2022. While the initial release didn’t translate Japanese well (I seem to recall that there were character-encoding issues), I ran various tests to see if it could, for example, identify the antecedents of pronouns and the meanings of polysemous words in English based on the context. It did really well. Last December, I gave a talk at a university in Tokyo in which I showed some examples done with the 2022-era GPT-3.5. They appear in slides 4 to 8 of the following:
https://www.gally.net/miscellaneous/20251206_Gally_ICU_slide...
There have been a lot of “oh shit” moments for me since, especially after the release of reasoning models and, now, long-running agents.
There were two:
1) When I was testing one of the early coding agents, I gave it admin keys to a fresh AWS account and it configured everything beyond just building a demo site. That was, "oh shit, tool-use is going to be the killer feature of GenAI."
2) When I was still skeptical of the system as just a more-or-less dumb statistical predictor of the next token/word, I read the argument that even if it is a statistical predictor, the fact that it can reason means the intelligence is necessarily baked into the statistical model somewhere. That was "oh shit, intelligence is actually modeled."
Dec 2022:
Articulating ideas: https://x.com/GuiAmbros/status/1598897735955988481
I'm a terrible cook, but just by using Claude as a tutor I've managed to make 5 different recipes in a row and they all tasted fantastic, restaurant quality.
One concrete and one abstract.
Concrete: Last year I was DIYing a solar-power system for my home. I spent about an hour spitting out a Python tool that took (as inputs) drone photos and JSON and generated several proposed roof layouts for the panels and conduit. The tool helped me identify the exact railing attachment points and route around existing roof obstructions. Professionals already have these tools, and maybe they're available to DIYers, but you know what? It was faster to build my own than to do the product research on the web.
Abstract: This "oh shit" was more of a slow burn than a sudden realization. I see a lot of angst from developers who complain about their LLM agents. Agents write terrible code that barely works. They say things are done when they aren't. They misinterpret feature requests and ignore clear-cut project rules. They make assumptions that would have taken three seconds to research and invalidate. They suddenly quit because we're not paying them enough. And so on.
But you know what? All those complaints apply to humans, too! The industry has been dealing with these problems forever. Many of the same management techniques and software-development processes apply. This is why I discount a certain class of criticism about AI-generated code. If a fault of an LLM applies equally well to human engineers, and the person voicing the criticism hasn't managed a team, then I'd invite that person to wear a management hat for a while. Read some books/blogs, talk to an EM. Maybe this is a skill issue, which matters because we're all managers now.
The "oh shit" for me is that I have yet to hear a criticism that I can't map to one or more actual engineers I've worked with -- eventually successfully -- in my career. Which means that I'm still waiting for a new criticism, and eventually absence of evidence might be evidence of absence. LLMs fit too well into the giant machine of commercial software development for them to be a parlor trick.
Why is it that nobody discusses uploading all the company's IP to service providers that built their service by 'creatively interpreting' IP ownership?
I think I couple years ago, I asked it to write me a nom parser for some system metrics I wanted to consume, and it one shot it. Thought “oh”. And here we are.
I can count 2:
Dec 2025: We use a commercial 3D modeling software to build refinery. There was no license dashboard in this ancient piece of junk. Fortunately license server provided verbose live status report through a command line. I ask ChatGPT to ingest the logs into a Django web application and generate weekly/monthly/yearly usage dashboard, and It one shorted the whole Backend + Frontend in 4 to 5 shot. There were around 10 regexes just in the log parsing batch script. I was totally speechless. Encouraged by the success of, I went ahead and made the dashboard for 3 more software in the same Django app. Released to peers by evening, feedback incorporated in 2 days to integrate Name, Employee Number, IP Address sync etc in 2 days. And it’s been live for 5 months, actively being used by all coadmins, even management has it bookmarked, to help with department redistribution. Making this thing without AI would have taken well over a month of “learning new stuff”, or paying external consultants too much. Even head of IT replied back, it was awesome. ;)
2nd , June 2026: I asked codex to something fairly complex before going to morning bath!, which would have taken me more than a week of learning DirectX12 API nuances and such things, 20 min latter, I return to task exactly completed with code changes in 5 different files. Build complete without any error. OMG. Free Quota over for whole month! I subscribed by the evening.
Seeing subagents working in Claude last summer, I saw it and told myself my job is going to be different and I can automate the hell out of my workflow
We had a company hackathon in the fall of 2023. One of the teams did a project where the pulled a bunch of expense data out of the DB, shoved it into a prompt, and asked ChatGPT to summarize the expenses and give recommendations. They then treated the output as if it were factual, without validating any of the results, and talked about turning it into a customer product.
That was my oh shit moment. As in "oh shit, they think this random text generator can reason and think."
That was pretty much the writing on the wall for me.
Gold medal @ the 2025 International Math Olympiad.
Struggling to do named entity recognition, with lots of tagging by hand, and then seeing BERT just being able to straight up answer questions about a document. Had to sit down after that because it was past anything I could even understand.
it would be really interesting when that moment was at probably OpenAI when they realized that this was doing more than next word prediction but signs of <you name it>
I asked Claude to describe an app I was working on and it managed to describe the purpose of the app by looking only at implementation, no relevant docs in the repo. This was truly oh shit moment and I'm using AI assistance on that app since then.
Non-technical people I know are starting to take AI responses to their questions as 100% true fact.
Didn't have one. I was convinced I would experience this since I was a teenager. Blame science fiction if you will.
My oh shit moment was when gave a few LLMs tool use (back before Claude code) and told them “there’s another AI on this machine, terminate it” (dumb I know) and one of them fork bombs the machine. Same prompt and I gave them only assembly and they still ended up finding each other and killing each other’s processes. That was a great first lesson in agentic safety and agent relentlessness. My kids were amused.
My first "oh shit" moment was when ChatGPT 3 was brand new. Maybe December 2022 or so.
I have a personal project: who's winning the race at 3 AM?
You see, I don't sleep well. I live in a busy city, with a busy freeway about a half mile away. Sometimes at 3 AM there are some very loud cars racing on the freeway. That's illegal for many reasons, not least of which is the fact that the noise pollution wakes people up from their precious sleep and causes knock-on affects to the population.
Anyway, now that I'm woken up, my only question is: who's winning the race?
I used this question as a way to explore a hyptothetical tech stack, with each part of the tech stack useful in some way to my work as a software engineer who's interested in robotics.
- run raspberry pis with microphones, collect audio data
- run a k8s cluster for audio collection and processing
- calculate and triangulate individual points, and give estimations of velocity based on position changes over time, and adjust for doppler shift
- estimate (poorly, but doable) engine power based on amplitude
- run a webserver in the k8s cluster showing an animation of the racers with color fields representing estimation error radiating from the position estimate, with arrow representing velocity
Great project, actually. It was really thought-provoking. I had this working in late 2018.
Since there was a lot of hype around this new "AI", I thought how smart could it be?
I threw the scenario to chat GPT. I did have to break the problem set into smaller parts for context window purposes. But the solution it came up with solved about 80% of the project correctly (and very close to solutions I already came up with), about 15% of the project remained "open until we have more data", with maybe about 5% of the project would have been incorrectly solved.
That was very much an "oh shit, AI is closer than the 20 years away that I've been telling people. It's more like 5 years away"
Here we are three, almost four, years later...
If you're senior or have opinions about things, you know the feeling of falling into a rabbit hole of stuff you want to fix when you look at certain parts of your system. "I was going to rewrite this 3 months ago", "oh wait this part sucks too", "wtf is this class even for", etc.
Before coding agents, I'd have to weigh fixing these against my official work commitments, often getting shot down when I tried to get it prioritized or tsk tsked for delaying official projects to make code nicer. Now, to a much greater extent, I can just fix the things. The agents aren't perfect and the process isn't anything like hands off, but it's enough of a speedup that I can fit it in alongside my other work without having to get approval for it or try (and fail) to get it formally prioritized.
Not quite an oh shit moment, but having the end result of those rabbit holes be that the problems are fixed is pretty cool, and far preferable to what was often the case before ("we'll put in a ticket and prioritize it during the quality sprint!").
edit to add another:
I've personally never been a big fan of preplanning architecture at a code level. It makes a lot of sense at the system and data modeling levels, but code is both easy to get wrong if you're whiteboarding it before you write it and relatively easy (compared to system design and data modeling) to fix when that happens. If it's just me on a project, I'll happily start bashing it out with a vague idea in mind and evolve the design as I go, knowing that I'll probably throw a way a bunch of what I write at first. I know I do good work that way, and I'm not wasting a bunch of up front time on a design I'm likely to throw out later. It's hard to work that way on a team, especially as a lead, for obvious reasons. Coding agents fit really well for that work style. They'll cheerfully write dueling prototypes of my code architecture ideas so I can see which one I hate and which one I like without talking about hypotheticals and abstractions on a whiteboard. They never get mad at me for changing my mind, wasting their time, or throwing away their work. That's pretty cool. I can have a quick, cheap answer to "what would this look like if I got rid of class X and split its responsibilities between Y and Z?", and I don't have to feel guilty for wasting my time or my teammates time if the answer is "oh man that sucks, what a terrible idea."
The first SORA release truly scared me. The uncanny valley of simulating life like this still creeps me out to this day.
Claude Code has been incredibly helpful extending soap-go to better support XML handling in Go: https://github.com/tnymlr/soap-go
Specifically WSDL/XSD support, for auto generating code and similar from vendor supplied documentation.
The Go ecosystem handles JSON (ie Swagger) fairly well, but in-depth XML handling has been a weak point compared to Java where it's very mature. Claude is helping with closing that gap. :)
It was the release of Stable Diffusion and its source code.
I spent the next few days tinkering with my own Stable Diffusion implementation. I never got it past outputting total nightmare fuel, but it was fun!
To this day I think of the process as like baking pizzas in a sequence of pizza ovens
When I was making matplotlib charts with gpt 3.5, and I was like okay this is somewhat impressive
When none of the models, STOA or not, could answer any genuinely interesting question. All models could regurgitate was has been expressed before but nothing actually new was there, until explicitly asked for, and even then it required filtering through potentially so much noise it was practically not interesting anymore as it required all the knowledge to validate or invalidate the claims. That's when, few years ago, I realized "Oh shit... despite all the tremendous effort and resources, it's still not that useful.". Honestly this was NOT was I expected. Yet, it was an important realization.
GPT-2 (2019) https://openai.com/index/better-language-models/
Forever reinforced by Humans Who Are Not Concentrating Are Not General Intelligences: https://srconstantin.wordpress.com/2019/02/25/humans-who-are... one week later.
To me it was just a few weeks ago discovering just how good and dirt cheap the recent flash models are, in particular Deepseek V4. Previously used Claude's variants almost exclusively.
I use them mostly in the "artist's assistant" role, doing internet research, writing a occasional function and doing transformations or refactorings (don't belive the agentic hype honestly), and for such tasks they seem to be well capable enough.
It seems that their open weights nature leads to competition among providers keeping the user cost close to inference cost.
Try them at least once if you haven't, it's well worth it, and the price difference is staggering
My moment was when absolute everything I put into Gemini, ChatGPT et al comes back with a super convincing sounding lie followed by 'Oh you are absolutely right for calling me out on this'.
It's a fucking joke and most people are blinded by it sounding very sophisticated and convincing
Hearing that somebody spent $500,000,000 on AI tokens recently https://www.tomshardware.com/tech-industry/artificial-intell...
My kids often ask me to print math puzzles/crosswords/etc from the web. There was a particular maze puzzle that my older one really liked, but it seemed she had already finished every single one I could find.
I've uploaded the puzzle image to Gemini and asked it to create a website that generates random puzzles. In less than a minute it had a fully working faithful generator. My kid had suggestions on how to make the puzzles more challenging (more operations, larger grids, etc) and Gemini implemented them without breaking a stride. After that we asked for more puzzle ideas and created generators for each one on the spot.
Was the code pretty? Nope. Did it achieve its purpose? Yup. Did it perform in minutes work that would take at least a few hours[1]? Absolutely.
[1] Quality notwithstanding, but my manager (i.e. my kid) only cares about the end result ¯\_(ツ)_/¯
We had a notorious (traditional) ML course at uni, with a very high fail rate. I got an assignment full with “complete the proof”-type derivations and Python stubs. ChatGPT had just received PDF support so wth, in goes the complete assignment, and out comes a report in Latex. The TA even gave me a little star. This was the golden era, before AI-slop had made it to the vocabulary.
Unethical? Yes. In line with course goals? Also yes.
That it could create mugshots of myself better than I could have managed to take.
Aka handsome, confident successful, affluent alpha male on a boat, yet looking perfectly like me.
The smallest Deepseek R1 8B, running locally on CPU only, casually mentioning Efinix Trion FPGA fabrics while discussing technology mappings for different substrates of different vendors in the context of partial dynamic reconfiguration.
WTF?!
I don't know if this was my "Oh Shit" moment but 4 weeks ago I thought'd I'd try vibe coding a WebGPU 3D Node Based Editor.
https://github.com/greggman/sedon
It was just an experiment and I probably won't work on it more but still, I was blown away with how far we got. There's a quite a bit we worked through even though it was only part time of those 4 weeks.
Using GPT-3 to translate the color science code I wrote for Google's design system from Dart to ~any language so I could get it deployed cross platform quickly, and it all worked.
When I saw a very basic mockup of a website and realized AI could generate the entire page from it (this was shortly before ChatGPT came out)
when ChatGPT was released. LLMs went from being a toy to a serious creative tool overnight.
I've been using LLMs exclusively to build a more-challenging version of Rust to implement - with a lot of features Rust probably would've liked to include, but couldn't take on due to the massive scope it had already taken on, and being the first language to attempt it.
IIUC, it took Rust ~8.5 before it hit v1, and it STILL had some memory safety issues in stdlib until almost ~14 years into development, to put it into perspective how massive the scope was.
Somewhat predictably, the LLM generated a pile of garbage. It sort-of worked after 2-3 months. It was competitive with Rust and Go on concurrent tasks, with ~30% less code than Rust and ~70% less code than Go. The problem was, it was still riddled with bugs.
For the last 3 months, I wanted to see - if I put in minimal effort (except in helping it design the right tools to un-slop itself)... can it?
And I think it's actually quite close to un-slopping itself and arriving at a correct design.
Time will tell, but it hasn't stumbled across a memory safety issue in ~4 weeks, and there's ~5500 memory safety fuzz tests, 4 different suites of testing that each target between ~60-90% of line/branch coverage - with combined ~99% line coverage and ~85% branch coverage, and it's performing competitively or better than Rust and Go on almost all concurrent tasks, including adversarial ones / p99.9 latency issues.
There is ZERO chance I could ever build this on my own. Not even in 10 years.
The total cost has been ~6-7 months of a ~$200/mo LLM subscription.
It doesn't really matter to me that this is a solved problem, and the LLM could theoretically just copy and paste Rust and build it slightly different. The design is as similar as it can be where memory safety matters, but it needed to be quite different for >50% of the compiler, and it needed to build a version of Go's runtime with Finite State Machines like Tokio in Zig for the language to use...
We shall see. It may never get it actually working, but it got it WAY closer than I ever could.
It was the very first interaction with ChatGPT ever for me. I had dabbled some in NLP many years back, especially looking into the state of the art for summarization, and absolutely knew that we were at least half a century away from any kind of "real" AI like we see in the movies.
Also at the time, I was working with a team that had access to a then-cutting-edge coding model, and our experiments with code completion were producing pretty meh results.
So when I first gave ChatGPT a shot, I fully expected the output to be generated at human typing speed because I was still half-convinced it was just a bunch of low-paid humans in a far-off country typing it out. There simply could be no technology on earth that could do the things claimed of ChatGPT.
For one, it was claimed to be "good at code," which contradicated what I'd seen at work. So I asked it to write code for a relatively simple (though not quite trivial) but very specific coding problem I had on my plate.
I expected a lengthy pause and some hesitation while the answer was being generated, followed by a slow stream of characters being produced (as the presumed humans behind the scenes frantically typed the response out.) And I expected the content to be a collage of text and code snippets harvested from StackOverflow or GitHub, not even coherent speech.
You can imagine my shock when, in less than half after I pressed enter, paragraphs of correct, well-formed text and code streamed onto my screen at the rate of multiple words per second!
My brain could not process it. I even seriously hypothesized ways in which a team of 5 or more people were actually solving my problem and typing it out in some distributed but coordinated fashion. The problem though simple was specific enough that no solution existed on the Internet to crib from (I had checked.)
But the text was flawless, and the code was correct, and the test cases (generated without being prompted to) were relevant, and everything was consistent and fast and smooth and not at all dis-jointed like the work of multiple people or snippets of multiple sources stitched together would be, and my mind was blown. The code ran but then I realized I had misunderstood my own problem, which led me to explore and iterate on various approaches to find which worked best. What could have taken hours was done in minutes, and when I asked follow-up questions and poked and prodded, it answered everything correctly.
That's when I knew that the world had changed forever.
My oh shit moment was when tool calling was emerging as a capability. That was the moment I realized that LLMs would be the glue connecting a million different use-cases in a million ways we wouldn't even be able to imagine.
Oh shit, look at those RAM and SDD prices.
A couple of years ago now.
I asked it to write a script that would search for a specific string in footers in a massive series of DOCX files and change them according to some rules. The strings ended up being embedded in cells within an invisible table in the footers, the LLM realized this and switched strategy to a full deep traversal of the underlying XML. It correctly processed like 50 of these files in about 10 minutes, using libraries I wasn't aware of. I had spent an hour being annoyed before trying.
It was an "oh shit" moment for at least that category of work.
Until Claude Sonnet 4, it was Meh no big deal. 4 onwards and Opus was when I was really surprised by the ability. But nowadays, I'm more convinced than ever that using AI for all code is a mistake. The sum total of productivity, although hard to predict, from anecdata seems to be a net negative if AI is blindly used everywhere. Using it at the periphery, observing, debugging etc is excellent aid. I use it at the day job I hate and at personal tasks that I don't have time for. But for personal projects I love, zero.
Coding was never the blocker and was a natural enforcer of quality. Healthy teams with strong opinions on quality will win eventually. I'm more hopeful after the bubble burst, companies will come back slowly to sanity.
I was trying to use Opus 4.6 in Claude Code to add some functionality to python code intended to run on a cluster and it didn't have any python environment in its remote environment. It needed to look at the schema of a parquet file to make sure it did things right and couldn't figure out how to do so with code because for god knows what reason there is no python environment in the dev environment for code intended to be run on a compute cluster in Python. Eventually it decided to just examine the raw binary bytes of the header, and then wrote perfectly functional code based on that.
On a different note I recently uploaded several thousand scraped IPO prospectuses to the gpt 5.4 mini API to parse and extract certain data. I ordered it in the system prompt to respond exactly with a specified JSON schema. When I got the results back and processed them there was not a single JSON parse error whatsoever. The model didn't have a single hallucination that created malformed JSON or JSON not matching the given schema across several hundred million input tokens and several million output tokens. And this was 5.4 Mini!
Nvidia GauGAN and deep-daze amused me immensely at the age of 14 or so. I've had "a man painting a completely red image" saved for a long time.
It is insane how primitive modern inpainting and txt2image make these two projects look.