History LLMs: Models trained exclusively on pre-1913 texts

728 points • by iamwil • yesterday at 10:39 PM • 360 comments • view on HN

Comments

Ontologically, this historical model understands the categories of "Man" and "Woman" just as well as a modern model does. The difference lies entirely in the attributes attached to those categories. The sexism is a faithful map of that era's statistical distribution.

You could RAG-feed this model the facts of WWII, and it would technically "know" about Hitler. But it wouldn't share the modern sentiment or gravity. In its latent space, the vector for "Hitler" has no semantic proximity to "Evil".

➕ show 1 reply

joeycastillo • today at 12:20 AM

A question for those who think LLM’s are the path to artificial intelligence: if a large language model trained on pre-1913 data is a window into the past, how is a large language model trained on pre-2025 data not effectively the same thing?

➕ show 3 replies

sbmthakur • today at 4:26 PM

Someone suggested a nice thought experiment - train LLMs on all Physics before quantum physics was discovered. If the LLM can see still figure out the latter then certainly we have achieved some success in the space.

DonHopkins • today at 8:34 AM

I'd love for Netflix or other streaming movie and series services to provide chat bots that you could ask questions about characters and plot points up to where you have watched.

Provide it with the closed captions and other timestamped data like scenes and character summaries (all that is currently known but no more) up to the current time, and it won't reveal any spoilers, just fill you in on what you didn't pick up or remember.

zkmon • today at 7:18 AM

Why does history end in 1913?

ianbicking • today at 12:15 AM

The knowledge machine question is fascinating ("Imagine you had access to a machine embodying all the collective knowledge of your ancestors. What would you ask it?") – it truly does not know about computers, has no concept of its own substrate. But a knowledge machine is still comprehensible to it.

It makes me think of the Book Of Ember, the possibility of chopping things out very deliberately. Maybe creating something that could wonder at its own existence, discovering well beyond what it could know. And then of course forgetting it immediately, which is also a well-worn trope in speculative fiction.

➕ show 1 reply

3vidence • today at 2:50 AM

This idea sounds somewhat flawed to me based on the large amount of evidence that LLMs need huge amounts of data to properly converge during their training.

There is just not enough available material from previous decades to trust that the LLM will learn to relatively the same degree.

Think about it this way, a human in the early 1900s and today are pretty much the same but just in different environments with different information.

An LLM trained on 1/1000 the amount of data is just at a fundamentally different stage of convergence.

moffkalast • today at 10:39 AM

> trained from scratch on 80B tokens of historical data

How can this thing possibly be even remotely coherent with just fine tuning amounts of data used for pretraining?

casey2 • today at 7:51 AM

I'd be very surprised if this is clean of post-1913 text. Overall I'm very interested in talking to this thing and seeing how much difference writing in a modern style vs and older one makes to it's responses.

alexgotoi • today at 7:18 AM

The coolest thing here, technically, is that this is one of the first public projects treating time as a first‑class axis in training, not just a footnote in the dataset description.

Instead of “an LLM with a 1913 vibe”, they’re effectively doing staged pretraining: big corpus up to 1900, then small incremental slices up to each cutoff year so you can literally diff how the weights – and therefore the model’s answers – drift as new decades of text get added. That makes it possible to ask very concrete questions like “what changes once you feed it 1900–1913 vs 1913–1929?” and see how specific ideas permeate the embedding space over time, instead of just hand‑waving about “training data bias”.

lifestyleguru • today at 1:16 AM

You think Albert is going to stay in Zurich or emigrate?

satisfice • today at 12:44 AM

I assume this is a collaboration between the History Channel and Pornhub.

“You are a literary rake. Write a story about an unchaperoned lady whose ankle you glimpse.”

holyknight • today at 9:24 AM

wow amazing idea

r0x0r007 • today at 10:48 AM

ffs, to find out what figures from the past thought and how they felt about the world, maybe we read some of their books, we will get the context. Don't prompt or train LLM to do it and consider it the hottest thing since MCP. Besides, what's the point? To teach younger generations a made up perspective of historic figures? Who guarantees the correctness/factuality? We will have students chatting with made up Hitler justifying his actions. So much AI slop everywhere.

TZubiri • today at 5:05 AM

hi, can I have latin only LLM? It can be latin plus translations (source and destination).

May be too small a corpus, but I would like that very much anyhow

anovikov • today at 6:16 AM

That Adolf Hitler seems to be a hallucination. There's totally nothing googlable about him. Also what could be the language his works were translated from, into German?

➕ show 1 reply

superkuh • yesterday at 11:11 PM

smbc did a comic about this: http://smbc-comics.com/comic/copyright The punchline is that the moral and ethical norms of pre-1913 texts are not exactly compatible with modern norms.

➕ show 1 reply

usernamed7 • today at 11:49 AM

> We're developing a responsible access framework that makes models available to researchers for scholarly purposes while preventing misuse.

oh COME ON... "AI safety" is getting out of hand.

internationalis • today at 7:34 AM

[dead]

internationalis • today at 7:34 AM

[dead]

acharneski • today at 6:17 PM

[dead]

alt Hacker News

History LLMs: Models trained exclusively on pre-1913 texts

Comments