logoalt Hacker News

gregw2today at 2:01 PM2 repliesview on HN

You articulate your case well, thank you!

I always warn people (particularly junior people) though that blindly dropping duplicates is a dangerous habit because it helps you and others in your organization ignore the causes of bad data quickly without getting them fixed at the source. Over time, that breeds a lot of complexity and inefficiency. And it can easily mask flaws in one's own logic or understanding of the data and its properties.


Replies

michaelbartontoday at 6:21 PM

Exactly. It’s not that getting rid of duplicates is bad, is that they may be a symptom of something worse. E.g. incorrect aggregation logic

DangitBobbytoday at 2:09 PM

When I'm in pandas (or was, I don't use it anymore) I'm always downstream of some weird data process that ultimately exported to a CSV from a team that I know has very lax standards for data wrangling, or it is just not their core competency. I agree that duplicates are a smell but they happen often in the use-cases that I'm specifically reaching to pandas for.