These sorts of core-density increases are how I win cloud debates in an org. * Identify the worklo...

stego-tech • yesterday at 7:22 PM • 16 replies • view on HN

These sorts of core-density increases are how I win cloud debates in an org.

* Identify the workloads that haven't scaled in a year. Your ERPs, your HRIS, your dev/stage/test environments, DBs, Microsoft estate, core infrastructure, etc. (EDIT, from zbentley: also identify any cross-system processing where data will transfer from the cloud back to your private estate to be excluded, so you don't get murdered with egress charges)

* Run the cost analysis of reserved instances in AWS/Azure/GCP for those workloads over three years

* Do the same for one of these high-core "pizza boxes", but amortized over seven years

* Realize the savings to be had moving "fixed infra" back on-premises or into a colo versus sticking with a public cloud provider

Seriously, what took a full rack or two of 2U dual-socket servers just a decade ago can be replaced with three 2U boxes with full HA/clustering. It's insane.

Back in the late '10s, I made a case to my org at the time that a global hypervisor hardware refresh and accompanying VMware licenses would have an ROI of 2.5yrs versus comparable AWS infrastructure, even assuming a 50% YoY rate of license inflation (this was pre-Broadcom; nowadays, I'd be eyeballing Nutanix, Virtuozzo, Apache Cloudstack, or yes, even Proxmox, assuming we weren't already a Microsoft shop w/ Hyper-V) - and give us an additional 20% headroom to boot. The only thing giving me pause on that argument today is the current RAM/NAND shortage, but even that's (hopefully) temporary - and doesn't hurt the orgs who built around a longer timeline with the option for an additional support runway (like the three-year extended support contracts available through VARs).

If we can't bill a customer for it, and it's not scaling regularly, then it shouldn't be in the public cloud. That's my take, anyway. It sucks the wind from the sails of folks gung-ho on the "fringe benefits" of public cloud spend (box seats, junkets, conference tickets, etc...), but the finance teams tend to love such clear numbers.

Replies

carefree-bob • yesterday at 8:33 PM

The main cost with on-prem is not the price of the gear but the price of acquiring talent to manage the gear. Most companies simply don't have the skillset internally to properly manage these servers, or even the internal talent to know whether they are hiring a good infrastructure engineer or not during the interview process.

For those that do, your scaling example works against you. If today you can merge three services into one, then why do you need full time infrastructure staff to manage so few servers? And remember, you want 24/7 monitoring, replication for disaster recovery, etc. Most businesses do not have IT infrastructure as a core skill or differentiator, and so they want to farm it out.

➕ show 11 replies

zbentley • yesterday at 7:33 PM

That’s definitely the right call in some cases. But as soon as there’s any high-interconnect-rate system that has to be in cloud (appliances with locked in cloud billing contracts, compute that does need to elastically scale and talks to your DB’s pizza box, edge/CDN/cache services with lots of fallthrough to sources of truth on-prem), the cloud bandwidth costs start to kill you.

I’ve had success with this approach by keeping it to only the business process management stacks (CRMs, AD, and so on—examples just like the ones you listed). But as soon as there’s any need for bridging cloud/onprem for any data rate beyond “cronned sync” or “metadata only”, it starts to hurt a lot sooner than you’d expect, I’ve found.

➕ show 1 reply

PaulKeeble • yesterday at 9:32 PM

What has surprised me about the cloud is that the price has been towards ever increasing prices for cores. Yet the market direction is the opposite, what used to be a 1/2 or a 1/4 of a box is now 1/256 and its faster and yet the price on the cloud has gone ever up for that core. I think their business plan is to wipe out all the people who used to maintain the on premise machines and then they can continue to charge similar prices for something that is only getting cheaper.

Its hard drive and SSD space prices that stagger me on the cloud. Where one of the server CPUs might only be about 2x the price of buy a CPU for a few years if you buy less in a small system (all be it with less clock speed usually on the cloud) the drive space is at least 10-100x the price of doing it locally. Its got a bit more potential redudency but for that overhead you can repeat that data a lot of times.

As time has gone on the deal of cloud has got worse as the hardware got more cores.

bearjaws • yesterday at 9:11 PM

I just don't know if the human capital is there.

At my job we use HyperV, and finding someone who actually knows HyperV is difficult and expensive. Throw in Cisco networking, storage appliances, etc to make it 99.99% uptime...

Also that means you have just one person, you need at least two if you don't want gaps in staffing, more likely three.

Then you still need all the cloud folks to run that.

We have a hybrid setup like this, and you do get a bit of best of both worlds, but ultimately managing onprem or colo infra is a huge pain in the ass. We only do it due to our business environment.

➕ show 2 replies

jmward01 • yesterday at 8:08 PM

Cloud = the right choice when just starting. It isn't about infra cost, it is about mental cost. Setting up infra is just another thing that hurts velocity. By the time you are serving a real load for the first time though you need to have the discussion about a longer term strategy and these points are valid as part of that discussion.

➕ show 2 replies

jfindley • yesterday at 8:48 PM

Do note though that AIUI these are all E-cores, have poor single-threaded performance and won't support things like AVX512. That is going to skew your performance testing a lot. Some workloads will be fine, but for many users that are actually USING the hardware they buy this is likely to be a problem.

If that's you then the GraniteRapids AP platform that launched previously to this can hit similar numbers of threads (256 for the 6980P). There are a couple of caveats to this though - firstly that there are "only" 128 physical cores and if you're using VMs you probably don't want to share a physical core across VMs, secondly that it has a 500W TDP and retails north of $17000, if you can even find one for sale.

Overall once you're really comparing like to like, especially when you start trying to have 100+GbE networking and so on, it gets a lot harder to beat cloud providers - yes they have a nice fat markup but they're also paying a lot less for the hardware than you will be.

Most of the time when I see takes like this it's because the org has all these fast, modern CPUs for applications that get barely any real load, and the machines are mostly sitting idle on networks that can never handle 1/100th of the traffic the machine is capable of delivering. Solving that is largely a non-technical problem not a "cloud is bad" problem.

➕ show 1 reply

madduci • yesterday at 8:30 PM

Is your calculation also taking cost of energy and personnel that keeps your own infra running?

➕ show 1 reply

umvi • yesterday at 9:55 PM

Is using virtualization the only good way of taking a 288-core box and splitting it up into multiple parallel workloads? One time I rented a 384-core AMD EPYC baremetal VM in GCP and I could not for the life of me get parallelized workloads to scale just using baremetal linux. I wanted to run a bunch of CPU inference jobs in parallel (with each one getting 16 cores), but the scaling was atrocious - the more parallel jobs you tried to add, the slower all of them ran. When I checked htop the CPU was very underutilized, so my theory was that there was a memory bottleneck somewhere happening with ONNX/torch (something to do with NUMA nodes?) Anyway, I wasn't able to test using proxmox or vmware on there to split up cpu/memory resources; we decided instead to just buy a bunch of smaller-core-count AMD Ryzen 1Us instead, which scaled way better with my naive approach.

➕ show 2 replies

klooney • today at 2:06 AM

Man, how do you get box seats out of AWS, I'm missing out

zer00eyz • yesterday at 9:06 PM

> These sorts of core-density increases are how I win cloud debates in an org.

AMD has had these sorts of densities available for a minute.

> Identify the workloads that haven't scaled in a year.

I have done this math recently, and you need to stop cherry picking and move everything. And build a redundant data center to boot.

Compute is NOT the major issue for this sort of move:

Switching and bandwidth will be major costs. 400gb is a minimum for interconnects and for most orgs you are going to need at least that much bandwidth top of rack.

Storage remains problematic. You might be able to amortize compute over this time scale, but not storage. 5 years would be pushing it (depending on use). And data center storage at scale was expensive before the recent price spike. Spinning rust is viable for some tasks (backup) but will not cut it for others.

Human capital: Figuring out how to support the hardware you own is going to be far more expensive than you think. You need to expect failures and staff accordingly, that means resources who are going to be, for the most part, idle.

user5994461 • yesterday at 8:26 PM

> These sorts of core-density increases are how I win cloud debates in an org.

The core density is bullshit when each core is so slow that it can't do any meaningful work. The reality is that Intel is 3 times behind AMD/TSMC on performance vs power consumption ratio.

People would be better off having a look at the high frequency models (9xx5F models like the 9575F), that was the first generation of CPU server to reach ~5 GHz and sustain it on 32+ cores.

➕ show 1 reply

fnord77 • today at 5:27 AM

on prem = capex

cloud = opex

The accounting dept will always win this debate.

varispeed • yesterday at 9:26 PM

That only works if purchasers in the organisation are immune to kickbacks.

mschuster91 • yesterday at 7:42 PM

> If we can't bill a customer for it, and it's not scaling regularly, then it shouldn't be in the public cloud. That's my take, anyway. It sucks the wind from the sails of folks gung-ho on the "fringe benefits" of public cloud spend (box seats, junkets, conference tickets, etc...), but the finance teams tend to love such clear numbers.

I agree, but.

For one, it's not just the machines themselves. You also need to budget in power, cooling, space, the cost of providing redundant connectivity and side gear (e.g. routers, firewalls, UPS).

Then, you need a second site, no matter what. At least for backups, ideally as a full failover. Either your second site is some sort of cloud, which can be a PITA to set up without introducing security risks, or a second physical site, which means double the expenses.

If you're a publicly listed company, or live in jurisdictions like Europe, or you want to have cybersecurity insurance, you have data retention, GDPR, SOX and a whole bunch of other compliance to worry about as well. Sure, you can do that on-prem, but you'll have a much harder time explaining to auditors how your system works when it's a bunch of on-prem stuff vs. "here's our AWS Backup plans covering all servers and other data sources, here is the immutability stuff, here are plans how we prevent backup expiry aka legal hold".

Then, all of that needs to be maintained, which means additional staff on payroll, if you own the stuff outright your finance team will whine about depreciation and capex, and you need to have vendors on support contracts just to get firmware updates and timely exchanges for hardware under warranty.

Long story short, as much as I prefer on-prem hardware vs the cloud, particularly given current political tensions - unless you are a 200+ employee shop, the overhead associated with on-prem infrastructure isn't worth it.

➕ show 1 reply

justsomehnguy • yesterday at 9:24 PM

> The only thing giving me pause on that argument today is the current RAM/NAND shortage

Not a shortage - price gouging. And it would mean an increase in the 'cloud' prices because they need to refresh the HW too. So by the summer the equation would be back to it.

alt Hacker News

Replies