logoalt Hacker News

freakynittoday at 5:49 AM4 repliesview on HN

Open access for next 5 hours (8GiB model, running on RTX 3090) or until server crashes or the this spot instance gets taken away :) =>

https://ofo1j9j6qh20a8-80.proxy.runpod.net

  ./build/bin/llama-server \
   -m ../Bonsai-8B.gguf \
   -ngl 999 \
   --flash-attn on \
   --host 0.0.0.0 \
   --port 80 \
   --ctx-size 65500 \
   --batch-size 512 \
   --ubatch-size 512 \
   --parallel 5 \
   --cont-batching \
   --threads 8 \
   --threads-batch 8 \
   --cache-type-k q4_0 \
   --cache-type-v q4_0 \
   --log-colors on
The server can serve 5 parallel request, with each request capped at around `13K` tokens...

A bit of of benchmarks I did:

1. Input: 700 tokens, ttfs: ~0 second, outputs: 1822 tokens ~190t/s

1. Input: 6400+ tokens, ttfs: ~2 second, outputs: 2012 tokens at ~135t/s

Vram usage was consistently at ~4GiB.


Replies

ggerganovtoday at 6:53 AM

Better keep the KV cache in full precision

show 1 reply
TRCattoday at 6:48 AM

Thank you! I am impressed by the speed of it.

logicalleetoday at 5:58 AM

That was really impressive. https://pastebin.com/PmJmTLJN pretty much instantly. (Very weak models can't do this.)

kgeisttoday at 7:19 AM

[dead]