SRE or whatever they are calling Ops here, this blog left me with more "Please hire an Ops Prin...

stackskipton • 02/20/2025 • 1 reply • view on HN

SRE or whatever they are calling Ops here, this blog left me with more "Please hire an Ops Principal". That has nothing to do with Elixir.

We (Ops type people) have a developed system for gathering metrics, it's Prometheus stack. Instead of integrating with that system, OpsMaru decided it doesn't work and went with their own custom system. You are showing code you were building all CPU metrics that PromQL query easily does and you code assumes 15 second scrapes so if we need higher resolution temporarily, well, sucks to be your customer. Also, if you did Remote Write, you could Remote Write back to a customer if they wanted it. Hell, you could have written a system so we don't need run Prometheus locally since you would scrape everything and send it back to us.

Also, you are already running "my company code" so it might be emitting Prometheus metrics so I'm probably running Prometheus already so I can monitor my own code. However, if I wanted to keep an eye on OpsMaru Uplink, I can't because OpsMaru Uplink doesn't appear to have Metric endpoint I can monitor. Maybe your customers are too small to have Ops people but if they did, they are now blind.

I want blog article explaining all options tested and what pitfalls you ran into that you settled on this.

Replies

zacksiri • 02/20/2025

Thank you for your feedback, you have a valid point about the /metrics endpoint. We're planning on providing a /metrics endpoint in the future. I mentioned this previously as well in another reply.

This isn't a custom system at all we're simply removing the need to install / configure / manage another external package by implementing a data shipper into uplink using elixir broadway. The end goal is still that Ops / SREs can still use their existing favorite monitoring pipeline whether that's Grafana / Prometheus / Loki or Elastic / OpenSearch stack. There are several advantages, it means less things to install / maintain / patch / secure as mentioned in the post. We believe doings things this way leads to a more robust / secure system in the long term.

As for the 15 seconds scrapes we can tune that and provide that as an option for customers as well. These are things we can improve and provide to our customers as options. For now for MVP we're shipping data to the elastic stack certain decisions are made to help simplify and reduce the amount of things we have to do to get the product to an MVP.

We can provide the /metrics endpoint in the future, it's just a matter of time and priorities.

There are reasons why we're shipping data into elastic that will be clearer once things mature a little more. There are things Elastic can do that we need at a base level for our internal product plans, there will be a follow up post about this later.

Will provide more blog post articles giving you more details as to the decisions we've made and why we made them. Always happy to read feedback.

➕ show 1 reply

alt Hacker News

Replies