Excellent critique of the state of observability, especially for us IT folks. We’re often the first ...

stego-tech • yesterday at 7:50 PM • 2 replies • view on HN

Excellent critique of the state of observability, especially for us IT folks. We’re often the first - and last, until the bills come - line of defense for observability in orgs lacking a dedicated team. SNMP Traps get us 99% of the way there with anything operating in a standard way, but OTel/Prometheus/New Relic/etc all want to get “in the action” in a sense, and hoover up as much data points as possible.

Which, sure, if you’re willing to pay for it, I’m happy to let you make your life miserable. But I’m still going to be the Marie Kondo of IT and ask if that specific data point brings you joy. Does having per-second interval data points actually improve response times and diagnostics for your internal tooling, or does it just make you feel big and important while checking off a box somewhere?

Observability is a lot like imaging or patching: a necessary process to be sure, but do you really need a Cadillac Escalade (New Relic/Datadog/etc) to go to the grocery store when a Honda Accord (self-hosted Grafana + OTel) will do the same job more efficiently for less money?

Honestly regret not picking the Observability’s head at BigCo when I had the chance. What little he showed me (self-hosted Grafana for $90/mo in AWS ECS for the corporate infrastructure of a Fortune 50? With OTel agents consuming 1/3 to 1/2 the resources of New Relic agents? Man, I wish I had jumped down that specific rabbit hole) was amazingly efficient and informative. Observation done right.

Replies

jsight • yesterday at 8:45 PM

>Observability is a lot like imaging or patching: a necessary process to be sure, but do you really need a Cadillac Escalade (New Relic/Datadog/etc) to go to the grocery store when a Honda Accord (self-hosted Grafana + OTel) will do the same job more efficiently for less money?

The way that I've seen it play out is something like this:

  1. We should self host something like Grafana and otel.
  2. Oh no, the teams don't want to host individual instances of that, we should centralize it!
    (2b - optional, but common, Random team gets saddled with this job)
  3. Oh no, the centralized team is struggling with scaling issues and the service isn't very reliable. We should outsource it for 10x the cost!

This will happen even if they have a really nice set of deployment infrastructure and patterns that could have allowed them to host observability at the team level. It turns out, most teams really don't need the Escalade, they just need some basic graphs and alerts.

Self hosting needs to be more common within organizations.

➕ show 1 reply

rbanffy • yesterday at 8:01 PM

> But I’m still going to be the Marie Kondo of IT and ask if that specific data point brings you joy.

There seems to be a strong "instrument everything" culture that, I think, misses the point. You want simple metrics (machine and service) for everything, but if your service gets an error every million requests or so, it might be overkill to trace every request. And, for the errors, you usually get a nice stack dump telling you where everything went wrong (and giving you a good idea of what was wrong).

At that point - and only at that point, I'd say it's worth to TEMPORARILY add increased logging and tracing. And yes, it's OK to add those and redeploy TO PRODUCTION.

➕ show 3 replies

alt Hacker News

Replies