logoalt Hacker News

crawshawyesterday at 11:33 PM4 repliesview on HN

The idea that an "observability stack" is going to replace shell access on a server does not resonate with me at all. The metrics I monitor with prometheus and grafana are useful, vital even, but they are always fighting the last war. What I need are tools for when the unknown happens.

The tool that manages all my tools is the shell. It is where I attach a debugger, it is where I install iotop and use it for the first time. It is where I cat out mysterious /proc and /sys values to discover exotic things about cgroups I only learned about 5 minutes prior in obscure system documentation. Take it away and you are left with a server that is resilient against things you have seen before but lacks the tools to deal with the future.


Replies

ValdikSStoday at 12:20 AM

>It is where I attach a debugger, it is where I install iotop and use it for the first time. It is where I cat out mysterious /proc and /sys values to discover exotic things about cgroups I only learned about 5 minutes prior in obscure system documentation.

It is, SSH is indeed the tool for that, but that's because until recently we did not have better tools and interfaces.

Once you try newer tools, you don't want to go back.

Here's the example of my fairly recent debug session:

    - Network is really slow on the home server, no idea why
    - Try to just reboot it, no changes
    - Run kernel perf, check the flame graph
    - Kernel spends A LOT of time in nf_* (netfilter functions, iptables)
    - Check iptables rules
    - sshguard has banned 13000 IP addresses in its table
    - Each network packet travels through all the rules
    - Fix: clean the rules/skip the table for established connections/add timeouts
You don't need debugging facilities for many issues. You need observability and tracing.

Instead of debugging the issue for tens of minutes at least, I just used observability tool which showed me the path in 2 minutes.

show 3 replies
reactordevtoday at 12:19 AM

Or… you build a container, that runs exactly what you specify. You print your logs, traces, metrics home so you can capture those stack traces and error messages so you can fix it and make another container to deploy.

You’ll never attach a debugger in production. Not going to happen. Shell into what? Your container died when it errored out and was restarted as a fresh state. Any “Sherlock Holmes” work would be met with a clean room. We have 10,000 nodes in the cluster - which one are you going to ssh into to find your container to attach a shell to it to somehow attach a debugger?

show 2 replies
ValdikSStoday at 12:13 AM

>What I need are tools for when the unknown happens.

There are tools which show what happens per process/thread and inside the kernel. Profiling and tracing.

Check Yandex's Perforator, Google Perfetto. Netflix also has one, forgot the name.

gear54rusyesterday at 11:51 PM

Agreed, this sounds like some complicated ass-backwards way to do what k8s already does. If it's too big for you, just use k3s or k0s and you will still benefit from the absolutely massive ecosystem.

But instead we go with multiple moving parts all configured independently? CoreOS, Terraform and a dependence on Vultr thing. Lol.

Never in a million years I would think it's a good idea to disable SSH access. Like why? Keys and non-standard port already bring China login attempts to like 0 a year.