Recently built something where simple domain-specific heuristics crushed a fancy ML approach I assumed would win. This has me thinking about how often we reach for complex tools when simpler ones would work better. Occam's razor moments.
Anyone have similar stories? Curious about cases where knowing your domain beat throwing compute at the problem.
Several times I have rewritten overly-multithreaded (and intermittently buggy) processes with a single-threaded version, and both reduced LoC to roughly 1/20th and binary size to 1/10th, while also obtaining a few times speedup and reduced memory usage, and entirely eliminating many bugs.
When working at an influencer marketing company a while ago, back when Instagram still allowed pretty much complete access through their API. As we were indexing the entire Instagram universe for our internal tooling, we had this graph traversal setup to crawl Instagram profiles, then each of their followers etc. We’d need to keep track of visited profiles to not loop and had an Apache Storm cluster for the entire scraping pipeline. It worked, but was cumbersome to work with and monitor as well as couldn’t reach our desired throughput.
Given there were about a billion IG profiles total at the time, I just replaced the entire setup with a single Go script that iterated from 1 to billion and tried to scrape every id in between. That gave us 10k requests per second on a single machine, which was more than enough.
When I was on Google Docs, I watched the Google Forms team build a sophisticated ML model that attempted to detect when people were using it for nefarious purposes.
It underperformed banning the word "password" from a Google Form.
So that's what they went with.
Still happens all the time in certain finance tasks (eg trying to predict stock prices), but I'm not sure how long that will hold. As for why that might be, I don't think I can do any better than linking to this comment about a comment about your question: <https://news.ycombinator.com/item?id=45306256>.
I suspect that locating the referenced comment would require a semantic search system that incorporates "fancy models with complex decision boundaries". A human applying simple heuristics could use that system to find the comment.
In the "Dictionary of Heuristic" chapter, Polya's "How to Solve it" says this: *The feeling that harmonious simple order cannot be deceitful guides the discover in both in mathematical and in other sciences, and is expressed by the Latin saying simplex sigillum veri (simplicity is the seal of truth).*
Aside from https://news.ycombinator.com/item?id=46665611, way back in my engineering classes in university we had this design project... I'm not sure I've ever told the story publicly before and it brings a smile to remember it more than 20 years later.
My group (and some others) had to design a device to transport an egg from one side of a very simple "obstacle course" to the other, with the aid of beacons (to indicate the egg location and target, each along opposite ends) and light sensors. There was basically a single obstacle, a barrier running most of the way across the middle. The field was fairly small, I think 4 metres across by 3 metres wide.
The other teams followed tutorials, created beacons that emitted high-frequency light pulses and circuitry to filter out 60Hz ambient light and detect the pulse; various robots (I think at least one repurposed a remote-control car) and feedback control to steer them toward the beacons, etc. There were a few different microcontrollers on offer to us for this task, and groups generally had three people: someone responsible for the mechanical parts, someone doing circuitry, and someone doing assembly programming.
My group was just the two of us.
I designed extenders for the central barrier, a carriage to straddle the barrier, and a see-saw the length of the field. The machine would find the egg, scoop it into one end, tilt the see-saw (the other person's innovation: by releasing a stop allowing the counterweighted far side to fall), find the target and release the scoop on the other end. Our light sensors were pointed directly at the ceiling (the source of the "noise"), and put through a simple RC circuit to see that light as more or less constant. Our "beacons" were pieces of construction paper used to block the light physically. All controlled by a 3-bit finite state machine implemented directly in TTL/CMOS (I forget which).
And it worked in testing (praise for my partner; I would never have gotten the mechanics robust enough), but on presentation day the real barrier (made sloppily out of wood) was noticeably wider than specified and the carriage didn't fit on it.
As I recall, in later years the obstacle course was made considerably more complex, ruling out solutions like mine entirely. (There were other projects to choose from, for my year and later years, that as far as I know didn't require modification.)
I’m mostly a hardware engineer.
I needed to test pumping water through a special tube, but didn’t have access to a pump. I spent days searching how to rig a pump to this thing.
Then I remembered I could just hang a bucket of water up high to generate enough head pressure. Free instant solution!
It was a very long time ago, but during a programming competition one of the warm-up questions was something to do with a modified sudoku puzzle. The naive algorithmic solution was too slow, the fancy algorithm took quite a bit of effort... and then there were people who realised that the threshold for max points was higher than you needed for a brute force check of all possible boards. (I wasn't one of them)
This generalises to a few situations where going faster just doesn't matter. For example for many cli tools it matters if they finish in 1s or 10s. But once you get to 10ms vs 100ms, you can ask "is anyone ever likely to run this in a loop on a massive amount of data?" And if the answer is yes, "should they write their own optimised version then?"
For me, CP-SAT is the "dumb" solution that works in a lot of situations. Whenever a hackathon has a problem definable in constraints, that tends to be the first path I take and generally scores top 5
I once modeled user journeys on a website using fancy ML models that honored sequence information, i.e., order of page visits, only to be beaten by bag-of-words (i.e., page url becomes a vector dimension, but order is lost) decision tree model, which was supposed to be my baseline.
What I had overlooked was that journeys on that particular website were fairly constrained by design, i.e., if you landed on the home page, did a bunch of stuff, put product X in the cart - there was pretty much one sequence of pages (or in the worst case, a small handful) that you'd traverse for the journey. Which means the bag-of-words (BoW) representation was more or less as expressive as the sequence model; certain pages showing up in the BoW vector corresponded to a single sequence (mostly). But the DT could learn faster with less data.
I once on a project where we couldn't use third party libs. We needed a substring search but the needle could be 1 of N letters. My teammate loves SIMD and wanted to write a solution. I took a look at all of our data and the most strings were < 2kb with many being empty and < 40 letters. SIMD would have been overkill. So I wrote a simple dumb for loop checking each letter for the 3 interesting characters (`";\n`)
I wrote a clone of battle zone the old Atari tank game. For the enemy tank “AI” I just used a simple state machine with some basic heuristics.
This gave a great impression of an intelligent adversary with very minimal code and low CPU overhead.
I have a silly little internal website I use for bookmarks, searching internal tools, and some little utilities. I keep getting pressure to put it into our heavy and bespoke enterprise CICD process. I’ve seen people quit over trying to onboard into this thing… more than one. It’s complete overkill for my silly little site.
My “dumb” solution is a little Ansible job that just runs a git pull on the server. It gets the new code and I’m done. The job also has an option to set everything up, so if the server is wiped out for some reason I can be back up and running in a couple minutes by running the job with a different flag.
- before ML try linear or polynomial regression
- buying a bigger server is almost always better than distributed system
- Few lines of bash can often wipe out hundreds of lines of python.
I remember Scalyr, at least before they were bought by SentinelOne basically did parallel / SIMD grep for each search query and consistently beat data that was continually indexed by the likes of Splunk and ElasticSearch.
Great question, I could answer with many stories but here are two:
The (deliberately) very limited analytics software I wrote for my personal website[0] could have used database but I didn't want to add a dependency to what was a very simple project so I hacked up an in-memory datastructure that periodically dumps itself to disk as a json file. This gives persistence across reboots and at a pinch I can just edit the file with a text editor.
Game design is filled with "stupid" ideas that work well. I wrote a text-based game[1] that includes Trek-style starship combat. I played around with a bunch of different ideas for enemy AI before just reverting to a simple action drawn off the top of a small deck. It's a very easy system to balance and expand, and just as fun for the player.
The common one I fought long ago was folks who always use regular expressions when what they want is a string match, or contains, or other string library function.
I occasionally see people complaining about long TypeScript compile times where a small code base can take multiple minutes (possibly 10 minutes). I think to myself WTF, because large code bases should take no more than 20 seconds on ancient hardware.
On another note I recently wrote this large single page app that is just a collection of functions organized by page sections as a collection of functions according to a nearly flat typescript interface. It’s stupid simple to follow in the code and loads as fast as an eighth of a second. Of course that didn’t stop HN users from crying like children for avoiding use of their favorite framework.
Seen people tripped up with dynamodb like stores, especially when they have a misleading sql interface like Azure tables.
You cant be "agile" with them, you need to design your data storage upfront. Like a system design interview :).
Just use postgres (or friends) until you are webscale. Unless you really have a problem amenible to key/value storage.
I recently wrote a command-line full-text search engine [1]. I needed to implement an inverted index. I choose what seems like the "dumb" solution at first glance: a trie (prefix tree).
There are "smarter" solutions like radix tries, hash tables, or even skip lists, but for any design choice, you also have to examine the tradeoffs. A goal of my project is to make the code simpler to understand and less of a black box, so a simpler data structure made sense, especially since other design choices would not have been all that much faster or use that much less memory for this application.
I guess the moral of the story is to just examine all your options during the design stage. Machine learning solutions are just that, another tool in the toolbox. If another simpler and often cheaper solution gets the job done without all of that fuss, you should consider using it, especially if it ends up being more reliable.
[1] https://github.com/atrettel/wosp