These are not decisions that should be taken solely by whoever is programming the backend.
They need to be surfaced to the product owner to decide. There may very well be reasons pieces of data should not be stored. And all of this adds complexity, more things to go wrong.
If the product owner wants to start tracking every change and by who, that can completely change your database requirements.
So have that conversation properly. Then decide it's either not worth it and don't add any of these "extra" fields you "might" need, or decide it is and fully spec it out and how much additional time and effort it will be to do it as a proper feature. But don't do it as some half-built just-in-case "favor" to a future programmer who may very well have to rip it out.
On a personal project, do whatever you want. But on something professional, this stuff needs to be specced out and accounted for. This isn't a programming decision, it's a product decision.
While I like the YAGRI principle very much, I find that adding
- updated_at
- deleted_at (soft deletes)
- created_by etc
- permission used during CRUD
to every table is a solution weaker than having a separate audit log table.
I feel that mixing audit fields with transactional data in the same table is a violation of the separation of concerns principle.
In the proposed solution, updated_at only captures the last change only. A problem that a separate audit log table is not affected to.
*_at and *_by fields in SQL are just denormalization + pruning patterns consolidated, right?
Do the long walk:
Make the schema fully auditable (one record per edit) and the tables normalized (it will feel weird). Then suffer with it, discover that normalization leads to performance decrease.
Then discover that pruned auditing records is a good middle ground. Just the last edit and by whom is often enough (ominous foreshadowing).
Fail miserably by discovering that a single missing auditing record can cost a lot.
Blame database engines for making you choose. Adopt an experimental database with full auditing history. Maybe do incremental backups. Maybe both, since you have grown paranoid by now.
Discover that it is not enough again. Find that no silver bullet exists for auditing.
Now you can make a conscious choice about it. Then you won't need acronyms to remember stuff!
Additionally, mutable fields will quite often benefit from having a separate edit table which records the old value, the new value, who changed it, and when. Your main table’s created and updated times can be a function of (or a complement to) the edit table.
It is tempting to supernormalize everything into the relations object(id, type) and edit(time, actor_id, object_id, key, value). This is getting dangerously and excitingly close to a graph database implemented in a relational database! Implement one at your peril — what you gain in schemaless freedom you also lose in terms of having the underlying database engine no longer enforcing consistency on your behalf.
One thing I do quite frequently which is related to this (and possibly is a pattern in rails) is to use times in place of Booleans.
So is_deleted would contain a timestamp to represent the deleted_at time for example. This means you can store more information for a small marginal cost. It helps that rails will automatically let you use it as a Boolean and will interpret a timestamp as true.
A little while back, I had a conversation with a colleague about sorting entries by "updated at" in the user interface, and to my surprise this was not added by the backend team.
Many of these "we are going to need it"s come from experience. For example in the context of data structures (DS), I have made many "mistakes" that I do correctly a second time. These mistakes made writing algorithms for the DS harder, or made the DS have bad performance.
Sadly, it's hard to transfer this underlying breadth of knowledge and intuition for making good tradeoffs. As such, a one-off tip like this is limited in its usefulness.
The perils of UI design wagging the horse.
I like the heuristics described here. However if these things aren't making it into a product spec where appropriate, then I smell some dysfunction that goes beyond what's being stored by default.
Product need (expressed as spec, design, etc) should highlight the failure cases where we would expect fields like these to be surfaced.
I'd hope that any given buisness shouldn't need someone with production database access on hand to inform as to why/when/how 'thing' was deleted. Really we'd want the user (be it 'boss' or someone else) to be able to access that information in a controlled manner.
"What information do we need when something goes wrong?". Ask it. Drill it. Ask it again.
That said, if you can't get those things, this seems a fine way to be pragmatic.
Funny I've been developing an adaption layer that implements the functionality that I use in
https://docs.python-arango.com/en/main/
over tables in Postgres that has a PRIMARY _key and a JSONB document field. The issue is that I have a number of prototypes I've developed with arangodb but the license is awful and I don't feel like I can either open source or commercialize any of them until I'm running on an open source database.
It's a fun project because I don't need to support everything in python-arango, in fact I don't even need to support 100% of the features I use because I am free to change my applications. Also it's a chance to make the library that I really want to use so already it has real integer and UUID primary keys.
I just added a feature to have the library manage _created and _updated fields not just because I thought it was good in general but it was a feature I needed for a particular application, a crawler that fetches headlines from the HN API. I want to fetch headlines right away so I can avoid submitting duplicates but I also want accurate counts of how many votes and comments articles got and that involves recrawling again in two weeks. Of course _created and _updated are helpful for that.
Counter point: "Soft Deletion Probably Isn't Worth It" https://brandur.org/soft-deletion
I agree with this as written, as think it's important to have some degree of forethought when building out the DB to plan for future growth and needs.
That said, the monkey paw of this would be someone reading it and deciding they should capture and save all possible user data, "just in case", which becomes a liability.
Event-sourcing solves this. And with how cheap storage is, it should be more prevalent in the industry. IMO the biggest thing holding it back is that there isn't a framework that's plug-and-play (say like Next.js is to React) that provides people with that ability.
I've been working on one in Typescript (with eventual re-writes in other langs. like Rust and Go), but it's difficult even coming up with conventions.
The problem with updated_at and updated_by is that a given record could experience multiple updates by multiple people at multiple times, and you'd only have visibility into the most recent.
The logical conclusion here is to log the updates (and creations and deletions and undeletions and such) themselves:
CREATE TABLE foo_log (id,
foo_id,
whodunnit,
action,
performed_at,
column_1,
column_2,
-- ...
column_n);
Technically you don't even need the "foo" table anymore, since you can reconstruct its contents by pulling the most recent transaction for a given foo_id and discarding the reconstructed record if the most recent action on it was a deletion. Probably still a good idea to create a view or somesuch for the sake of convenience, but the point of this is that the log itself becomes the record of truth - and while this approach does cost some disk space (due to duplicated data) and read performance (due to the more complex query involved), it's invaluable for tracking down a record's full lifecycle. Even better if you can enforce append-only access to that table.This is a pretty typical approach for things like bookkeeping and inventory management (though usually those are tracking the deltas between the old and new states, instead of recording the updated states directly as the above example would imply).
We did exactly this when designing StatleyDB. We realized there are a set of metadata fields[0] that almost everyone needs when modeling data. Our approach takes this a step further in that we always track these fields transparently so if you forget to add them to your schema initially for any reason, that's okay, you can always add them later!
[0] https://docs.stately.cloud/schema/fields/#metadata-fields
Yes! Why something happened is incredibly important. Gitlab made this mistake hard. We have a medium sized instance with some complex CI pipelines and often they'll just get cancelled and it doesn't say why or even who by. And anyone can do it! The only option is to ask the entire company "did anyone cancel this?"
Just curious, how do people feel about this general style of soft deletes currently? Do people still use these in production or prefer to just delete fully or alternatively move deleted rows to a separate tables / schema?
I find the complexity to still feel awkward enough that makes me wonder if deleted_at is worth it. Maybe there are better patterns out there to make this cleaner like triggers to prevent deletion, something else?
As for the article, I couldn't agree more on having timestamps / user ids on all actions. I'd even suggest updated_by to add to the list.
As an acronym, it's easy to be misremembered as "You ARENT gonna read it" (based on the popularity of yagni) - and have the opposite advice spread..
It's not really YAGNI if you need it to debug, is it?
> But I've never heard someone complain about a table having too many timestamps.
I do. Each one is 8 bytes. At the billions of rows scale, that adds up. Disk is cheap, but not free; more importantly, memory is not cheap at all.
Agree, although the acronym in the article could be interpreted to mean “you are going to read it, so index it appropriately”, which is sort of bad advice and can lead to overindexing. There is probably something better for “add appropriate and conventional metadata” (the author suggests updated_at, created_at etc)
Not a huge fan of the example of soft delete, i think hard deletes with archive tables (no foreign key enforcement) is a much much better pattern. Takes away from the main point of the article a bit, but glad the author hinted at deleted_at only being used for soft deletes.
Five years ago everybody would lough about "soft deletes" or "marked as deleted". Whoever thought this is a good idea from a data protection perspective? You also lying in the face of your users with such a behavior. Shame.
It's a terrible post. What it suggests is to turn your head off and follow overly generalised principle. I guess when somebody invent yet another acronym it is a red flag.
Data has its own life cycles in every area it passes through. And it's part of requirements gathering to find those cycles: the dependent systems, the teams, and the questions you need to answer. Mindlessly adding fields won't save you in every situation.
Bonus point: when you start collecting questions while designing your service, you'll discover how mature your colleagues' thinking is.
Author is very kind! In practice, many times I saw only the CR/CRU of CRUD getting implemented.
For example: as a company aspires to launch its product, one of the first features implemented in any system is to add a new user. But when the day comes when a customer leaves, suddenly you discover no one implemented off-boarding and cleanup of any sort.
I don't like general advice like this, because it's too general. For many, it's probably good advice. For others, not so much.
Anyone who has worked at a small company selling to large B2B SaaS can attest we get like 20 hits a day on a busy day. Most of that is done by one person in one company, who is probably also the only person from said company you've ever talked to.
From that lens, this is all overkill. It's not bad advice, it's just that it will get quoted for scenarios it doesn't apply. Which also apply to K8S, or microservices at large even, and most 'do as I say' tech blogs.
Shipped and supported enough startup apps to learn this the hard way: users will delete things they shouldn’t, and you will be asked to explain or undo it. Soft deletes and basic metadata (created_at, deleted_by, etc.) have saved me multiple times — not for some future feature, just for basic operational sanity.
For `updated_at` and `deleted_at` making them nullable and null until touched is incredibly useful.
Answering queries like how many of these were never updated? Or how many of these were never cancelled?
Related from a few years ago: PAGNIs, for Probably Are Gonna Need Its - https://simonwillison.net/2021/Jul/1/pagnis/
Literally just have a good audit log and then you get all of this for free and more.
OT As the great Alone ("Last Psychiatrist") said, "if you read it, it's for you" - IYRIIFY
curious that both YAGNI and YAGRI arguments could realistically be made for the same fields. guess it boils down to whether someone’s YAGRI is stronger than their colleague’s YAGNI ( :
How do you distinguish from "you aren't gonna read it"? The acronym is poorly designed.
Why can't databases just remember stuff we delete, like a trash can?
This is good advice except for deleted_at. Soft deletion is rarely smart. Deleted things just accumulate and every time you query that table is a new opportunity to forget to omit deleted things. Query performance suffers a lot. It's just a needless complexity.
Instead, just for the tables where you want to support soft delete, copy the data somewhere else. Make a table like `deleteds (tablename text not null, data jsonb not null default '{}')` that you can stuff a serialized copy of the rows you delete from other tables (but just the ones you think you want to support soft delete on).
The theory here is: You don't actually want soft delete, you are just being paranoid and you will never go undelete anything. If you actually do want to undelete stuff, you'll end up building a whole feature around it to expose that to the user anyway so that is when you need to actually think through building the feature. In the meantime you can sleep at night, safe in the knowledge that the data you will never go look at anyway is safe in some table that doesn't cause increased runtime cost and development complexity.
YAGRI proponents organized themselves into a community to develop their...YAGRIculture.
I'll show myself out.
I have a different way of thinking about this: data loss. If you are throwing away data about who performed a delete it is a data loss situation. You should think about whether that’s OK. It probably isn’t.
[dead]
I don't get why all of the big RDBMSes (PostgreSQL, MariaDB/MySQL, SQL Server, Oracle, ...) don't seem to have built in support for soft deletes up front and center?
Where the regular DELETE wouldn't get rid of the data for real but rather you could query the deleted records as well, probably have timestamps for everything as a built in low level feature, vs having to handle this with a bunch of ORMs and having to remember to put AND deleted_at IS NULL in all of your custom views.If we like to talk about in-database processing so much, why don't we just put the actual common features in the DB, so that toggling them on or off doesn't take a bunch of code changes in app, or that you'd even be able to add soft deletes to any legacy app that knows nothing of the concept, on a per table basis or whatever.