logoalt Hacker News

Cloud Run GPUs, now GA, makes running AI workloads easier for everyone

305 pointsby mariuzlast Wednesday at 8:28 AM171 commentsview on HN

Comments

ashishblast Wednesday at 11:53 AM

I love Google Cloud Run and highly recommend it as the best option[1]. The Cloud Run GPU, however is not something I can recommend. It is not cost effective (instance based billing is expensive as opposed to request based billing), GPU choices are limited, and the general loading/unloading of model (gigabytes) from GPU memory makes it slow to be used as server less.

Once you compare the numbers it is better to use a VM + GPU if the utilization of your service is even only for 30% of the day.

1 - https://ashishb.net/programming/free-deployment-of-side-proj...

show 7 replies
isoprophlexlast Wednesday at 9:57 AM

All the cruft of a big cloud provider, AND the joy of uncapped yolo billing that has the potential to drain your creditcard overnight. No thanks, I'll personally stick with Modal and vast.ai

show 9 replies
mythzlast Wednesday at 10:32 AM

The pricing doesn't look that compelling, here are the hourly rate comparisons vs runpod.io vs vast.ai:

    1x L4 24GB:    google:  $0.71; runpod.io:  $0.43, spot: $0.22
    4x L4 24GB:    google:  $4.00; runpod.io:  $1.72, spot: $0.88
    1x A100 80GB:  google:  $5.07; runpod.io:  $1.64, spot: $0.82; vast.ai  $0.880, spot:  $0.501
    1x H100 80GB:  google: $11.06; runpod.io:  $2.79, spot: $1.65; vast.ai  $1.535, spot:  $0.473
    8x H200 141GB: google: $88.08; runpod.io: $31.92;              vast.ai $15.470, spot: $14.563
Google's pricing also assumes you're running it 24/7 for an entire month, where as this is just the hourly price for runpod.io or vast.ai which both bill per second. Wasn't able to find Google's spot pricing for GPUs.
show 5 replies
jbarrowlast Wednesday at 10:00 AM

I’m personally a huge fan of Modal, and have been using their serverless scale-to-zero GPUs for a while. We’ve seen some nice cost reductions from using them, while also being able to scale WAY UP when needed. All with minimal development effort.

Interesting to see a big provider entering this space. Originally swapped to Modal because big providers weren’t offering this (e.g. AWS lambdas can’t run on GPU instances). Assuming all providers are going to start moving towards offering this?

show 4 replies
montebicyclelolast Wednesday at 9:56 AM

Reason Cloud Run is so nice compared to other providers is that it has autoscaling, with scaling to 0. Meaning it can cost basically 0 if it's not being used. Also can set a cap on the scaling, e.g. 5 instances max, which caps the max cost of the service too. - Note, I only have experience with the CPU version of Cloud Run, (which is very reliable / easy).

show 1 reply
huksleylast Wednesday at 11:43 AM

A small and independent EU GPU cloud provider, DataCrunch (I am not affiliated), offers VMs with Nvidia GPUs even cheaper than Run Pod, etc

1x A100 80Gb 1.37€/hour

1x H100 80Gb 2.19€/hour

show 2 replies
gabe_monroylast Wednesday at 3:08 PM

i'm the vp/gm responsible for cloud run and GKE. great to see the interest in this! happy to answer questions on this thread.

albeebe1last Wednesday at 3:46 PM

Oh this is great news. After a $1000 bill running a model on vertex.ai continuously for a little test i forgot to shut down, this will be my go to now. I've been using Cloud Run for years running production microservices, and little hobby projects and i've found it simple and cost effective.

lemminglast Wednesday at 9:48 AM

If I understand this correctly, I should be able to stand up an API running arbitrary models (e.g. from Hugging Face), and it’s not quite charged by the token but should be very cheap if my usage is sporadic. Is that correct? Seems pretty huge if so, most of the providers I looked at required a monthly fee to run a custom model.

show 2 replies
m1last Wednesday at 8:39 PM

Love cloud run and this looks like a great addition. Only things I wish from cloud run is being able to run self hosted GitHub runners on it (last time I checked this wasn’t possible as it requires root), also the new worker pool feature seems great in practice but it looks like you have to write the scaler yourself rather than it being built in.

show 2 replies
jjulianolast Wednesday at 10:34 AM

I'm the developer of kdeps.com, and I really like Google Cloud Run, been using it since beta version. Kdeps outputs Dockerized full-stack AI agent apps that runs open-source LLMs locally, and my project works so well with GCR.

Aeolunlast Wednesday at 11:04 AM

That’s 67ct / hour for a gpu enabled instance. That’s pretty good, but I have no idea how T4 GPU’s compare against others.

show 1 reply
holografixlast Wednesday at 10:07 AM

The value in this really is running small custom models or the absolute latest open weight models.

Why bother when you can get payg API access to popular open weights models like Llama on Vertex AI model garden or at the edge on Cloudflare?

show 1 reply
gardnrlast Wednesday at 9:59 AM

The Nvidia L4 has 24GB of VRAM and consumes 72 watts, which is relatively low compared to other datacenter cards. It's not a monster GPU, but it should work OK for inference.

show 1 reply
pier25last Wednesday at 5:28 PM

How does this compare to Fly GPUs in terms of pricing?

ringerylesslast Wednesday at 12:22 PM

i wonder what all this hype-driven overcapacity will be used for by future generations.

once this bubble pops we are going to have some serious albeit high-latency hardware

show 3 replies
treksislast Wednesday at 3:36 PM

Everything good except the price.

moeadhamlast Wednesday at 11:47 AM

if only they had some decent GPUs. L4s are pretty limited these days.

show 2 replies
ninetynineninelast Wednesday at 10:37 AM

Im tired of using AI in cloud services. I want user friendly locally owned AI hardware.

Right now nothing is consumer friendly. I can’t get a packaged deal of some locally running ChatGPT quality UI or voice command system in an all in one package. Like what Macs did for PCs I want the same for AI.

show 5 replies
ivapelast Wednesday at 12:54 PM

Does anyone actually run a modest sized app and can share numbers on what one gpu gets you? Assuming something like vllm for concurrent requests, what kind of throughput are you seeing? Serving an LLM just feels like a nightmare.

einpoklumlast Wednesday at 5:41 PM

Why is commercial advertising published as a content article here?

felix_techyesterday at 2:49 PM

I've been using this for daily/weekly ETL tasks which saves quite a lot of money vs having an instance on all the time but it's been clunky.

The main issue is despite there being a 60 minute timeout available the API will just straight up not return a response code if your request takes > ~5 minutes in most cases so you gotta make sure you can poll where the datas being stored and let the client time out

show 1 reply
Rachman91last Wednesday at 2:11 PM

[flagged]

omneitylast Wednesday at 11:28 AM

> Time-to-First-Token of approximately 19 seconds for a gemma3:4b model (this includes startup time, model loading time, and running the inference)

This is my biggest pet-peeve with serverless GPU. 19 seconds is a horrible latency from the user’s perspective and that’s a best case scenario.

If this is the best one of the most experienced teams in the world can do, with a small 4B model, then it feels like serverless is really restricted to non-interactive use cases.

show 5 replies