I might be overly pessimistic, but this looks like a case of a person believing LLM hallucinations and making it write a paper.
I asked both Claude Code|Opus 4.5 and Codex|GPT 5.1 Codex Max (funny to ask LLMs, I know) to check the an1-core repo. I don't think they'd hallucinate on something like this (the code is quite small), but I do not claim expertise.
In short, both of them are saying that:
- The repo always runs the full teacher model to extract activations and uses them - see https://github.com/Anima-Core/an1-core/blob/main/an1_core/fi...
- There are weird stub files, e.g. the Hellaswag repro doesn't actually have the code to reproduce https://github.com/Anima-Core/an1-core/blob/main/experiments... "For full HellaSwag reproduction, see the paper" (why include the file at all then?)
- The actual "AN1 head" is just linear probing (freeze a pretrained model, train a classifier on its features). The full flow (as reported by CC) is "Text → [Full Transformer] → activations → [Tiny Head] → prediction"
Basically, there's no code to train a real "student" model that would run without the teacher.
===
The repo/paper say that there's a mythical "commercial version" that has all the goodies:
(repo)
> This reference implementation (an1-core) does not include the FPU, AN4, or other proprietary optimization components covered by these patents. It provides only the core scientific demonstration of the meaning fields phenomenon.
(paper)
> Production deployment: Optimized implementations (AN1-Turbo) with learned layer selection, adaptive loss scheduling, and CUDA-accelerated inference available under commercial license.
But right now we only have the code in the repo.
===
In the paper they show that the student model (30M params) gets ~82% on SST-2 (labels-only). But what what they don't show is that DistilBERT (>5 year old model) already achieves 91% on the same dataset despite only having 66M params.
Another weird tidbit from the paper - in the section where they show the economic impact, they claim that LLaMA 70B runs at 2 tok/s at batch size=1 on an H200. In reality that number is at least a magnitude bigger even without quantization, like 20-40 tok/s. With quantization it can easily be above 100 tok/s.