TimeCapsuleLLM: LLM trained only on data from 1800-1875

446 points • by admp • yesterday at 4:04 PM • 188 comments • view on HN

Comments

Would be interesting to train a cutting edge model with a cut off date of say 1900 and then prompt it about QM and relativity with some added context.

If the model comes up with anything even remotely correct it would be quite a strong evidence that LLMs are a path to something bigger if not then I think it is time to go back to the drawing board.

➕ show 14 replies

dash2 • yesterday at 6:04 PM

Mm. I'm a bit sceptical of the historical expertise of someone who thinks that "Who art Henry" is 19th century language. (It's not actually grammatically correct English from any century whatever: "art" is the second person singular, so this is like saying "who are Henry?")

➕ show 2 replies

eqmvii • yesterday at 4:39 PM

Could this be an experiment to show how likely LLMs are to lead to AGI, or at least intelligence well beyond our current level?

If you could only give it texts and info and concepts up to Year X, well before Discovery Y, could we then see if it could prompt its way to that discovery?

➕ show 9 replies

addaon • yesterday at 5:35 PM

Suppose two models with similar parameters trained the same way on 1800-1875 and 1800-2025 data. Running both models, we get probability distributions across tokens, let's call the distributions 1875' and 2025'. We also get a probability distribution finite difference (2025' - 1875'). What would we get if we sampled from 1.1*(2025' - 1875') + 1875'? I don't think this would actually be a decent approximation of 2040', but it would be a fun experiment to see. (Interpolation rather than extrapolation seems just as unlikely to be useful and less likely to be amusing, but what do I know.)

➕ show 1 reply

argestes • yesterday at 11:59 PM

I wonder how racist it is

tgtweak • yesterday at 7:47 PM

Very interesting but the slight issue I see here is one of data: the information that is recorded and in the training data here is heavily skewed to those intelligent/recognized enough to have recorded it and had it preserved - much less than the current status quo of "everyone can trivially document their thoughts and life" diorama of information we have today to train LLMs on. I suspect that a frontier model today would have 50+TB of training data in the form of text alone - and that's several orders of magnitude more information and from a much more diverse point of view than what would have survived from that period. The output from that question "what happened in 1834" read like a newspaper/bulletin which is likely a huge part of the data that was digitized (newspapers etc).

Very cool concept though, but it definitely has some bias.

➕ show 2 replies

radarsat1 • yesterday at 8:19 PM

Heh, at least this wouldn't spread emojis all over my readmes. Hm, come to think of it I wonder how much tokenization is affected.

Another thought, just occurred when thinking about readmes and coding LLMs: obviously this model wouldn't have any coding knowledge, but I wonder if it could be possible to combine this somehow with a modern LLM in such a way that it does have coding knowledge, but it renders out all the text in the style / knowledge level of the 1800's model.

Offhand I can't think of a non-fine-tuning trick that would achieve this. I'm thinking back to how the old style transfer models used to work, where they would swap layers between models to get different stylistic effects applied. I don't know if that's doable with an LLM.

➕ show 1 reply

chc4 • yesterday at 7:53 PM

I think it would be very cute to train a model exclusively in pre-information age documents, and then try to teach it what a computer is and get it to write some programs. That said, this doesn't look like it's nearly there yet, with the output looking closer to Markov chain than ChatGPT quality.

jimmytucson • yesterday at 5:24 PM

Fascinating idea. There was another "time-locked" LLM project that popped up on HN recently[1]. Their model output is really polished but the team is trying to figure out how to avoid abuse and misrepresentation of their goals. We think it would be cool to talk to someone from 100+ years ago but haven't seriously considered the many ways in which it would be uncool. Interesting times!

[1] https://news.ycombinator.com/item?id=46319826

InvisibleUp • yesterday at 5:49 PM

If the output of this is even somewhat coherent, it would disprove the argument that mass amounts of copyrighted works are required to train an LLM. Unfortunately that does not appear to be the case here.

➕ show 1 reply

CGMthrowaway • yesterday at 9:57 PM

Is there a link where I can try it out?

Edit: I figured it out

"The Lord of the Rings uding the army under the command of his brother, the Duke of York, and the Duke of Richmond, who fell in the battle on the 7th of April, 1794. The Duke of Ormond had been appointed to the command of the siege of St. Mark's, and had received the victory of the Rings, and was thus commanded to move with his army to the relief of Shenham. The Duke of Ormond was at length despatched to oppose them, and the Duke of Ormond was ordered

hallvard • yesterday at 5:46 PM

Cool! I also did something like this: https://github.com/hallvardnmbu/transformer

But on various data (i.e., separate model per source): the Bible, Don Quixote and Franz Kafka. (As well as a (bad!) lyrics generator, and translator.)

cowlby • yesterday at 7:47 PM

I wonder if you could train an LLM with everything up to Einstein. Then see if with thought experiments + mathematics you could arrive at general relativity.

simonw • yesterday at 4:21 PM

Anyone seen a low-friction way to run prompts through this yet, either via a hosted API or chat UI or a convenient GGML or MLX build that runs in Ollama or llama.cpp or LM Studio?

➕ show 4 replies

myrmidon • yesterday at 4:59 PM

There was a discussion around a very similar model (Qwen3 based) some weeks ago:

https://news.ycombinator.com/item?id=46319826

I found it particularly thought-inspiring how a model with training from that time period completely lacks context/understanding of what it is itself, but then I realized that we are the same (at least for now).

sl_convertible • yesterday at 7:22 PM

Harry Seldon would, no doubt, find this fascinating. Imagine having a sliding-window LLM that you could use to verify a statistical model of society. I wonder what patterns it could deduce?

patcon • yesterday at 7:24 PM

> OCR noise (“Digitized by Google”) still present in outputs

This feels like a neat sci-fi short story hook to explain the continuous emergence of God as an artifact of a simulation

abhishekjha • yesterday at 4:42 PM

Oh I have really been thinking long about this. The intelligence that we have in these models represent a time.

Now if I train a foundation models with docs from library of Alexandria and only those texts of that period, I would have a chance to get a rudimentary insight on what the world was like at that time.

And maybe time shift further more.

➕ show 1 reply

dlcarrier • yesterday at 5:22 PM

It's interesting that it's trained off only historic text.

Back in the pre-LLM days, someone trained a Markov chain off the King James Bible and a programming book: https://www.tumblr.com/kingjamesprogramming

I'd love to see an LLM equivalent, but I don't think that's enough data to train from scratch. Could a LoRA or similar be used in a way to get speech style to strictly follow a few megabytes worth of training data?

krunck • yesterday at 6:26 PM

Training LLMs on data with certain date cut-offs and then doing comparative analysis between the LLMs would be interesting.

weslleyskah • yesterday at 8:07 PM

I wonder if the LLM would hallucinate on hard science topics due to all the fantasy and romantic literature.

> Frankenstein – Mary Shelley (1818)

> The Vampyre – John Polidori (1819)

> Lord Byron, Shelley, Keats, ...

aqme28 • yesterday at 4:44 PM

This kind of technique seems like a good way to test model performance against benchmarks. I'm too skeptical that new models are taking popular benchmark solutions into their training data. So-- how does e.g. ChatGPT's underlying architecture perform on SWE-bench if trained only on data prior to 2024.

➕ show 1 reply

aussieguy1234 • yesterday at 11:18 PM

Let's see how someone from the past reacts when you tell them about modern technology

chuckadams • yesterday at 8:43 PM

Think I'll ask it to come up with some jacquard loom patterns. vibe-weaving.

HarHarVeryFunny • yesterday at 4:37 PM

It would be interesting if there's enough data to train a model capable enough to converse with and ask about contemporary views on issues of the day, or what it thought about "potential" future events/technologies yet to happen.

zingababba • yesterday at 4:45 PM

LOL

PROMPT:Charles Darwin

Charles DarwinECCEMACY. Sir, — The following case is interesting to me : — I was in London a fortnight, and was much affected with an attack of rheumatism. The first attack of rheumatism was a week before I saw you, and the second when I saw you, and the third when I saw you, and the third in the same time. The second attack of gout, however, was not accompanied by any febrile symptoms, but was accompanied with an increased flow of urine, and with a more copious discharge of urine. The third attack was an hour after I saw you, and was succeeded by a return of a paroxysm of gout, and a more rapid return of the gout. The fourth attack was also accompanied by a fever, but was not always accompanied by any febrile symptoms. The third attack of gout was a fortnight after you had been ill, and the fourth was followed by a paroxysm of gout. The fourth attack was a fortnight after you were attacked, and was accompanied by a sense

➕ show 2 replies

radiothomp • yesterday at 7:04 PM

A LLM trained only on data from certain time periods to ~reduce modern bias~ enhance past bias

marmalade2413 • yesterday at 5:16 PM

Can you confidently say that the architure of the LLM doesn't include any a priori bias that might effect the integrity of this LLM?

That is, the architectures of today are chosen to yield the best results given the textual data around today and the problems we want to solve today.

I'd argue that this lack of bias would need to be researched (if it hasn't been already) before this kind of model has credence.

LLMs aren't my area of expertise but during my PhD we were able to encode a lot of a priori knowledge through the design of neural network architectures.

Aperocky • yesterday at 8:32 PM

Looks a lot like the output from a markov chain...

escapecharacter • yesterday at 8:19 PM

I would pay like $200/month if there was an LLM out there that I could only communicate with using an old-timey telegraph key and morse code.

dhruv3006 • yesterday at 4:25 PM

This will be something good - would love something on Ollama or lmstudio.

philmo1 • yesterday at 4:23 PM

Exciting idea!

srigi • yesterday at 4:37 PM

"I'm sorry, my knowledge cuttoff is 1875"

tonymet • yesterday at 7:22 PM

the "1917 model" from a few weeks back post-trained the model with ChatGPT dialog. So it had modern dialect and proclivities .