Why do systems fail? Tandem NonStop system and fault tolerance

119 points • by PaulHoule • 10/11/2024 • 44 comments • view on HN

Comments

Animats • 10/11/2024

Tandem was interesting. They had a lot of good ideas, many unusual today.

* Databases reside on raw disks. There is no file system underneath the databases. If you want a flat file, it has to be in the database. Why? Because databases can be made with good reliability properties and made distributed and redundant.

* Processes can be moved from one machine to another. Much like the Xen hypervisor, which was a high point in that sort of thing.

* Hardware must have built in fault detection. Everything had ECC, parity, or duplication. It's OK to fail, but not make mistakes. IBM mainframes still have this, but few microprocessors do, even though the necessary transistors would not be a high cost today. (It's still hard to get ECC RAM on the desktop, even.)

* Most things are transactions. All persistent state is in the database. Think REST with CGI programs, but more efficient. That's what makes this work. A transaction either runs to successful completion, or fails and has no lasting effect. Database transactions roll back on failures.

The Tandem concept lived on through several changes of ownership and hardware. Unfortunately, it ended up at HP in the Itanium era, where it seems to have died off.

It's a good architecture. The back ends of banks still look much like that, because that's where the money is. But not many programmers think that way.

➕ show 6 replies

redbluff • 10/12/2024

As someone who has worked on nonstops for 35 years (and still counting!) it's nice to see them get a mention on here. I even have two at home, one a K2000 (MIPS) machine from the 90's and an Itanium server from a the mid 10's. I am pretty sure the suburbs lights dim when I fire them up :).

It's an interesting machine architecture to work on, especially the "Guardian 90" personality, and quite amazing that you can run late 70's based programs without a recompilation written for a CPU using TTL logic on a MIPS, Itanium or X86 CPU; not all of them mind you, and not if they were natively compiled. The note on Stratus was quite interesting for a long time the only real direct competitor Nonstop had in a real sense was Stratus. The other thing that makes these systems interesting is they have a unix like personality called "OSS" that allows you to run quite a bit of POSIX style unix programs.

My favourite nonstop story was in the big LA earthquake (89?) a friend of mine was working at a POS processor. When they returned to the building the Tandem machine was lying on its side, unplugged and still operating (these machines had their own battery backup). The righted it, plugged everything back in and the machine continued operating as though nothing happened. The fact that pretty much all the network comms were down kind of made this a moot point, but it was fascinating none the less. Pulling a CPU board, network board or disc controller or disc - all doable with no impact to transaction flow. The discs themselves were both mirrored and shadowed, which back in the day made these systems very expensive.

macintux • 10/11/2024

10 years ago I used Jim Gray's piece about Tandem fault tolerance in a talk about Erlang at Midwest.io (RIP, was a great conference).

https://youtu.be/E18shi1qIHU

Because it's a small world, a former Tandem employee was attending the talk. Unfortunately it's been long enough that I don't remember much of our conversation, but it was impressive to hear how they moved a computer between data centers; IIRC, they simply turned it off, and when they powered it back on, the CPU resumed precisely where it had been executing before.

(I have no idea how they handled the system clock.)

Jim Gray's paper:

https://jimgray.azurewebsites.net/papers/TandemTR86.2_FaultT...

➕ show 2 replies

082349872349872 • 10/11/2024

at Tandem, even the company coffee mugs had redundancy: https://i.etsystatic.com/33311136/r/il/08fbca/5271808290/il_...

sillywalk • 10/11/2024

I'm still hoping to find a more detailed article about modern X86-64 NonStop, complete with Mackie Diagrams.

The last one I can find is for the NonStop Advanced Architecture (on Itanium), with ServetNet. I gather that this was replaced with the NonStop Multicore Architecture (also on Itanium), with Infiniband, and I assume x86-64 is basically the same but on x86-64, but in pseudo big-endian.

➕ show 2 replies

lostemptations5 • 10/12/2024

So if Tandem is so out of favour these days, what do people and organizations use? AWS availability zones, etc?

vivzkestrel • 10/12/2024

completely unrelated to the topic written but i wanted to point it out. there is some accessiblity issue with this page. The arrow keys up and down do not scroll the page on Firefox 131.0.2 M1 Mac

hi-v-rocknroll • 10/12/2024

Stanford's Forsythe DC had a Tandem mainframe just inside the main floor area. It was a short beast standing on its own about 1.5m / 4' tall, and not in a 19" rack.

exikyut • 10/11/2024

[flagged]

alt Hacker News

Why do systems fail? Tandem NonStop system and fault tolerance

Comments