logoalt Hacker News

freakynittoday at 4:19 AM0 repliesview on HN

They actually explained this a few days back (can't seem to find the link right now). But, the core explanation part was it's architecture.

1. MoE (nothing new here, but, this helps a lot)

2. Compressed Attention Mechanisms (this is their core innovation) - this dramatically reduces the Key-Value (KV) cache requirements for longer contexts

Another thing that helps is significantly lower energy costs in China.

Another point from my own guess: they are running (some percentage) the inference on their own home-grown AI inference chips.