Duplication Isn't Always an Anti-Pattern

49 points • by birdculture • last Tuesday at 8:58 AM • 51 comments • view on HN

Comments

I've had coworkers in the past that treat code like it needs to be compressed. Like, in the huffman coding sense. Find code that exists in two places, put it in one place, then call it from the original places. Repeat until there's no more duplication.

It results in a brittle nightmare because you can no longer change any of it, because the responsibility of the refactored functions is simply "whatever the orignal code was doing before it was de-duplicated", and don't represent anything logical.

Then, if two places that had "duplicated" code before the refactoring need to start doing different things, the common functions get new options/parameters to cover the different use cases, until those get so huge that they start needing to get broken up too, and then the process repeats until you have a zillion functions called "process_foo" and "execute_bar", and nothing makes sense any more.

I've since become allergic to any sort of refactoring that feels like this kind of compression. All code needs to justify its existence, and it has to have an obvious name. It can't just be "do this common subset of what these 2 other places need to do". It's common sense, obviously, but I still have to explain it to people in code review. The tendency to want to "compress" your code seems to be strong, especially in more junior engineers.

➕ show 6 replies

theoldgreybeard • yesterday at 2:28 PM

I had the pleasure of Sandi Metz coming to a company I worked for and going us a “boot camp” of sorts for all of the engineering principles she espouses and it had a profound impact on how I view software development. Whatever the company paid for her to come - it was worth every penny.

“Prefer duplication over the wrong abstraction”

➕ show 2 replies

bmitch3020 • yesterday at 9:35 PM

I'm a fan of the Go proverb "a little copying is better than a little dependency"[1] and also the "rule of three"[2] when designing a shared dependency.

I think the JS developers could take a lesson from the Go proverb. I often write something from scratch to avoid a dependency because of the overhead of maintaining dependencies (or dealing with dependencies that cease to be maintained). If I only need a half dozen lines of code, I'm not going to import a dependency with a couple hundred lines of code, including lots of features I don't need.

The "rule of three" helps avoid premature abstractions. Put the code directly in your project instead of in a library the first time. The second time, copy what you need. And the third time, figure out what's common between all the uses, and then build the abstraction that fits all the projects. The avoids over-optimizing on a single use case and refactoring/deprecating APIs that are already in use.

[1]: https://go-proverbs.github.io/ [2]: https://en.wikipedia.org/wiki/Rule_of_three_(computer_progra...

bob1029 • yesterday at 4:44 PM

In terms of code & data, I would say that duplication is mostly upside because the cost of refactoring is negligible. If all call sites are truly duplicate usages, then normalizing them must be trivial. Otherwise, you are dealing with something that seems like duplication but is not. The nuance of things being so similar that we would prefer they be the same (but unfortunately they cannot be) is where we will find some of the shittiest decisions in software engineering. We should not be in a rush to turn the problem domain into a crystalline structure. The focus should be about making our customer happy and keeping our options open.

That said, I have found other areas of tech where duplication is very costly. If you are doing something like building a game, avoiding use of abstractions like prefabs and scriptable objects will turn into a monster situation. Failure to identify ways to manage common kinds of complexity across the project will result in rapid failure. I think this is what actually makes game dev so hard. You have to come up with some concrete domain model & tool chain pretty quickly that is also reasonably normalized or no one can collaborate effectively. The art quality will fall through the basement level if a designer has to touch the same NPC in 100+ places every time they iterate a concept.

arealaccount • yesterday at 2:12 PM

Classic similar blog post https://sandimetz.com/blog/2016/1/20/the-wrong-abstraction

➕ show 1 reply

vanschelven • yesterday at 2:47 PM

When a colleague told my father that "duplication is always bad" he grabbed a random memo from that colleague's desk and said "I bet there's at least 3 copies of this piece of paper in this building". That drove the point home alright.

➕ show 2 replies

lateforwork • yesterday at 3:14 PM

If you have only one copy of the code then you only have to fix the bug in one place, as opposed to a dozen. So there is significant cost savings. But there is a problem: when you make a bug fix you have to test all the different places it is used. If you don't then you could be breaking something while fixing something. If you have comprehensive automated tests then you can have just one copy of the code--if you introduce a bug while fixing a bug the automated tests will catch it.

If you don't have comprehensive test automation then you have to consider whether you can manually test all the places it is used. If the code is used in multiple products at your company--and you aren't even familiar with some of those products then you can't manually test all the places it is used. Under such circumstances it may be preferable for each team to have duplicate copies of some code. Not ideal, but practical.

➕ show 2 replies

jamesbelchamber • yesterday at 2:50 PM

"Don't Repeat Yourself" is a great rule of thumb which, at least in writing Terraform configuration, became absolute tyranny. Strange webs of highly coupled code with layers of modules, all in an effort to be DRY - almost everywhere I've seen Terraform.

Trying to explain why a little duplication is preferable to bad abstractions, and specifically preferable to tightly coupling two unrelated systems together because they happened to sort-of interact with the same resource, was endless and tiring and - ultimately - often futile.

➕ show 6 replies

polygot • yesterday at 11:19 PM

https://archive.ph/eR3gp

ilitirit • yesterday at 3:18 PM

I had a lengthy argument about this in our architecture forum. I argued that "re-use" shouldn't be included as an Enterprise (keyword here) Architecture principle because they are clear use-cases where duplication is preferable to alternatives. e.g. deployment and testing decoupling etc etc. I had a lot of resistance, and eventually we just ended up with an EA principle with a ton of needless caveats.

It's unfortunate that so many people end up parroting fanciful ideas without fully appreciating the different contexts around software development.

➕ show 1 reply

brandensilva • yesterday at 3:02 PM

I have to agree, it's much easier to remove and consolidate duplicative work than unwind a poor abstraction that is embedded everywhere.

And I think it's easy to see small companies lean on the duplication because it's too easy to screw up abstractions without more engineering heads involved to get it right sometimes.

➕ show 2 replies

didip • yesterday at 4:20 PM

I don’t understand why so many engineers have tendencies to dedupe code.

Data, which is more important than code imho, are constantly duplicated all the time. Why can’t code have some duplication?

nullzzz • yesterday at 4:03 PM

Requires Medium account to read. Sorry, not going for it.

mkoubaa • yesterday at 2:28 PM

Sometimes, duplication is a small price to pay for isolation

➕ show 1 reply

sunrunner • yesterday at 2:53 PM

I've been writing up a similar piece for my own personal blog (though as much to collect my own thoughts on this) that touches on this idea, particularly as it applies to shared code/modules, re-usable components in general, and also any kind of templater or builder-type tool, and the costs of over-eager abstraction, sharing and re-use, and when (if ever) to pivot to get a net positive result.

As it's only a draft piece at the moment I'll lay out some of the talking points:

- All software design and structure decisions have trade-offs (no value without some kind of cost, we're really shifting what or where the cost is to a place we find acceptable)

- 'Dont Repeat Yourself' as a principle taught as good engineering practice and why you should think about repeating yourself; don't take social proof or appeal to authority-type arguments without solid experience

- There is a difference between things that are actually the same (or should be for consistency (such as domain facts, knowledge) versus ones that happen to be the same at the time of creation but are only that way by coincidence

- Effective change almost always (if not always always) comes from actual, specific use-cases; a reusable component not derived from these cases cannot show these

- Re-usable components themselves are not necessarily deployed or actually used, so by definition can't drive their own growth

- If they are deployed, it's N+1 things to maintain, and if you can't maintain N how are you going to maintain N+1?

- The costs of creation and ongoing maintenance - quite simply there's a cost to doing it and doing it well, and if it costs more to develop than the value gained then it's a net loss

- Components/modules that are used in the same places their use cases are get naturally tested and have specific use-cases; taking them out removes the opportunity for organic use cases

- What happens when we re-use components to allow easy upgrades but then pin those for stability? You still have to update N places. The best case scenario might be you have to update N places but the work to do that is minimised for each element of N

- Creation of an abstraction without enough variety of uses in terms of location and variety of use (a single use-case is essentially a layer that adds no value)

- Inherent contradictions in software design principles - you're taught to 'avoid coupling', but any shared component is by definition coupled. The value of duplication is that it support independent growth or change

- The cost of service templates and/or builders (simple templated text or entire builder-type tools that need to be maintained and used just to boostrap something) - these almost never work for you after creation to support updates

- The cost of fast up-front creation (if you're doing this a lot, maybe you have a different problem) over supporting long-term maintenance

- The value of friction - some friction that makes you question whether a 'New thing' is even needed is arguably good as a screening/design decision analysis step; having to do work to make shared things should help to identify if it's worth doing as the costs of that should be apparent; this frames friction as a way of avoiding doing things that look easy or cost-free but aren't in the long term

- As a project lives longer, any fixed up-front creation time diminishes to a miniscule fraction of the overall time spent

- Continuous, long-term drift detection (and update assistance) is more powerful and useful than a fixed-time upfront bootstrap time saving for any project with a significant-enough lifetime

➕ show 1 reply

drivers99 • yesterday at 2:35 PM

“The author made this story available to Medium members only.”

toomim • yesterday at 3:10 PM

Duplication isn't always bad. It's often rational. I wrote an academic paper explaining why, and offering a solution:

https://invisible.college/toomim/toomim-linked-editing.pdf

> Abstractions can be costly, and it is often in a programmer’s best interest to leave code duplicated instead. Specifically, we have identified the following general costs of abstraction that lead programmers to duplicate code (supported by a literature survey, programmer interviews, and our own analysis). These costs apply to any abstraction mechanism based on named, parameterized definitions and uses, regardless of the language.

> 1. *Too much work to create.* In order to create a new programming abstraction from duplicated code, the programmer has to analyze the clones’ similarities and differences, research their uses in the context of the program, and design a name and sequence of named parameters that account for present and future instantiations and represent a meaningful “design concept” in the system. This research and reasoning is thought-intensive and time-consuming.

> 2. *Too much overhead after creation.* Each new programming abstraction adds textual and cognitive overhead: the abstraction’s interface must be declared, maintained, and kept consistent, and the program logic (now decoupled) must be traced through additional interfaces and locations to be understood and managed. In a case study, Balazinska et. al reported that the removal of clones from the JDK source code actually increased its overall size [4].

> 3. *Too hard to change.* It is hard to modify the structure of highly-abstracted code. Doing so requires changing abstraction definitions and all of their uses, and often necessitates re-ordering inheritance hierarchies and other restructuring, requiring a new round of testing to ensure correctness. Programmers may duplicate code instead of restructuring existing abstractions, or in order to reduce the risk of restructuring in the future.

> 4. *Too hard to understand.* Some instances of duplicated code are particularly difficult to abstract cleanly, e.g. because they have a complex set of differences to parameterize or do not represent a clear design concept in the system. Furthermore, abstractions themselves are cognitively difficult. To quote Green & Blackwell: “Thinking in abstract terms is difficult: it comes late in children, it comes late to adults as they learn a new domain of knowledge, and it comes late within any given discipline.” [20]

> 5. *Impossible to express.* A language might not support direct abstraction of some types of clones: for instance those differing only by types (float vs. double) or keywords (if vs. while) in Java. Or, organizational issues may prevent refactoring: the code may be fragile, “frozen”, private, performance-critical, affect a standardized interface, or introduce illegal binary couplings between modules [41].

> Programmers are stuck between a rock and hard place. Traditional abstractions can be too costly, causing rational programmers to duplicate code instead—but such code is viscous and prone to inconsistencies. Programmers need a flexible, lightweight tool to complement their other options.

alt Hacker News

Duplication Isn't Always an Anti-Pattern

Comments