I might be overly pessimistic, but this looks like a case of a person believing LLM hallucinations a...

Tiberium • today at 6:53 AM • 0 replies • view on HN

I might be overly pessimistic, but this looks like a case of a person believing LLM hallucinations and making it write a paper.

I asked both Claude Code|Opus 4.5 and Codex|GPT 5.1 Codex Max (funny to ask LLMs, I know) to check the an1-core repo. I don't think they'd hallucinate on something like this (the code is quite small), but I do not claim expertise.

In short, both of them are saying that:

- The repo always runs the full teacher model to extract activations and uses them - see https://github.com/Anima-Core/an1-core/blob/main/an1_core/fi...

- There are weird stub files, e.g. the Hellaswag repro doesn't actually have the code to reproduce https://github.com/Anima-Core/an1-core/blob/main/experiments... "For full HellaSwag reproduction, see the paper" (why include the file at all then?)

- The actual "AN1 head" is just linear probing (freeze a pretrained model, train a classifier on its features). The full flow (as reported by CC) is "Text → [Full Transformer] → activations → [Tiny Head] → prediction"

Basically, there's no code to train a real "student" model that would run without the teacher.

===

The repo/paper say that there's a mythical "commercial version" that has all the goodies:

(repo)

> This reference implementation (an1-core) does not include the FPU, AN4, or other proprietary optimization components covered by these patents. It provides only the core scientific demonstration of the meaning fields phenomenon.

(paper)

> Production deployment: Optimized implementations (AN1-Turbo) with learned layer selection, adaptive loss scheduling, and CUDA-accelerated inference available under commercial license.

But right now we only have the code in the repo.

===

In the paper they show that the student model (30M params) gets ~82% on SST-2 (labels-only). But what what they don't show is that DistilBERT (>5 year old model) already achieves 91% on the same dataset despite only having 66M params.

Another weird tidbit from the paper - in the section where they show the economic impact, they claim that LLaMA 70B runs at 2 tok/s at batch size=1 on an H200. In reality that number is at least a magnitude bigger even without quantization, like 20-40 tok/s. With quantization it can easily be above 100 tok/s.

alt Hacker News