logoalt Hacker News

dogma1138yesterday at 4:20 PM16 repliesview on HN

Would be interesting to train a cutting edge model with a cut off date of say 1900 and then prompt it about QM and relativity with some added context.

If the model comes up with anything even remotely correct it would be quite a strong evidence that LLMs are a path to something bigger if not then I think it is time to go back to the drawing board.


Replies

bazzarghyesterday at 4:51 PM

You would find things in there that were already close to QM and relativity. The Michelson-Morley experiment was 1887 and Lorentz transformations came along in 1889. The photoelectric effect (which Einstein explained in terms of photons in 1905) was also discovered in 1887. William Clifford (who _died_ in 1889) had notions that foreshadowed general relativity: "Riemann, and more specifically Clifford, conjectured that forces and matter might be local irregularities in the curvature of space, and in this they were strikingly prophetic, though for their pains they were dismissed at the time as visionaries." - Banesh Hoffmann (1973)

Things don't happen all of a sudden, and being able to see all the scientific papers of the era its possible those could have fallen out of the synthesis.

show 4 replies
wongarsuyesterday at 7:27 PM

I'm trying to work towards that goal by training a model on mostly German science texts up to 1904 (before the world wars German was the lingua franca of most sciences).

Training data for a base model isn't that hard to come by, even though you have to OCR most of it yourself because the publicly available OCRed versions are commonly unusably bad. But training a model large enough to be useful is a major issue. Training a 700M parameter model at home is very doable (and is what this TimeCapsuleLLM is), but to get that kind of reasoning you need something closer to a 70B model. Also a lot of the "smarts" of a model gets injected in fine tuning and RL, but any of the available fine tuning datasets would obviously contaminate the model with 2026 knowledge.

show 2 replies
DevX101yesterday at 5:28 PM

Chemistry would be a great space to explore. The last quarter of the 19th century had a ton of advancements in chemistry. It'd be interesting the see if an LLM could propose fruitful hypotheses, made predictions of the science of thermodynamics.

forgotpwd16yesterday at 4:48 PM

Done few weeks ago: https://github.com/DGoettlich/history-llms (discussed in: https://news.ycombinator.com/item?id=46319826)

At least the model part. Although others made same thought as you afaik none tried it.

show 1 reply
bravurayesterday at 6:03 PM

A rigorous approach to predicting the future of text was proposed by Li et al 2024, "Evaluating Large Language Models for Generalization and Robustness via Data Compression" (https://ar5iv.labs.arxiv.org/html//2402.00861) and I think that work should get more recognition.

They measure compression (perplexity) on future Wikipedia, news articles, code, arXiv papers, and multi-modal data. Data compression is intimately connected with robustness and generalization.

kristopolousyesterday at 9:29 PM

It's going to be divining tea leaves. It will be 99% wrong and then someone will say 'oh but look at this tea leaf over here! It's almost correct"'

show 1 reply
samuelsonyesterday at 7:21 PM

I think it would be fun to see if an LLM would reframe some scientific terms from the time in a way that would actually fit in our current theories.

I imagine if you explained quantum field theory to a 19th century scientists they might think of it as a more refined understanding of luminiferous aether.

Or if an 18th century scholar learned about positive and negative ions, it could be seen as an expansion/correction of phlogiston theory.

nickdothuttonyesterday at 7:13 PM

I would love to ask such a model to summarise the handful of theories or theoretical “roads” being eyed at the time and to make a prediction with reasons as to which looks most promising. We might learn something about blind spots in human reasoning, institutions, and organisations that are applicable today in the “future”.

defgenericyesterday at 8:38 PM

The development of QM was so closely connected to experiments that it's highly unlikely, even despite some of the experiments having been performed prior to 1900.

Special relativity however seems possible.

root_axisyesterday at 8:10 PM

I think it would raise some interesting questions, but if it did yield anything noteworthy, the biggest question would be why that LLM is capable of pioneering scientific advancements and none of the modern ones are.

show 1 reply
tokaiyesterday at 4:42 PM

Looking at the training data I don't think it will know anything.[0] Doubt On the Connexion of the Physical Sciences (1834) is going to have much about QM. While the cut-off is 1900, it seems much of the texts a much closer to 1800 than 1900.

[0] https://github.com/haykgrigo3/TimeCapsuleLLM/blob/main/Copy%...

show 1 reply
imjonseyesterday at 4:30 PM

I suppose the vast majority of training data used for cutting edge models was created after 1900.

show 2 replies
metalliqazyesterday at 4:58 PM

Yann LeCun spoke explicitly on this idea recently and he asserts definitively that the LLM would not be able to add anything useful in that scenario. My understanding is that other AI researchers generally agree with him, and that it's mostly the hype beasts like Altman that think there is some "magic" in the weights that is actually intelligent. Their payday depends on it, so it is understandable. My opinion is that LeCun is probably correct.

show 5 replies
damnitbuildsyesterday at 9:15 PM

I like this, it would be exciting (and scary) if it deduced QM, and informative if it cannot.

But I also think we can do this with normal LLMs trained on up-to-date text, by asking them to come up with any novel theory that fits the facts. It does not have to be a groundbreaking theory like QM, just original and not (yet) proven wrong ?

a-dubyesterday at 4:20 PM

yeah i was just wondering that. i wonder how much stem material is in the training set...

show 1 reply
nickpsecurityyesterday at 7:17 PM

That would be an interesting experiment. It might be more useful to make a model with a cut off close to when copyrights expire to be as modern as possible.

Then, we have a model that knows quite a bit in modern English. We also legally have a data set for everything it knows. Then, there's all kinds of experimentation or copyright-safe training strategies we can do.

Project Gutenberg up to the 1920's seems to be the safest bet on that.