logoalt Hacker News

Mr. Chatterbox is a Victorian-era ethically trained model

54 pointsby y1n0today at 2:26 AM35 commentsview on HN

Comments

_fwtoday at 10:07 AM

One thing I think would be very useful here is national archive data: there will be thousands of letters, memos and official documents shared between people alive back then under the care of a museum or government.

One of my dreams is to help digitise and make available the thousands of Second World War-era documents in the National Archives at Kew.

We’re at the point where a simple phone camera and a robust LLM-powered process can digitise ENORMOUS amounts of archive material almost effortlessly [1]. This is going to be enormous for historians eager to dive into the millions of interesting primary sources.

[1 https://generativehistory.substack.com/p/gemini-3-solves-han...]

bossyTeachertoday at 11:11 AM

Prompt: do you know what america is?

Response: Indeed! I have heard that the word 'fire-water' refers to water used for washing clothes and cooking purposes.

lovelearningtoday at 5:40 AM

I thought the title meant the training data used was ethics content and ethical reasoning. Turns out "ethically trained" means the training data used doesn't violate copyright laws.

show 3 replies
kgeisttoday at 5:50 AM

Prior art: https://news.ycombinator.com/item?id=46590280

>TimeCapsuleLLM: LLM trained only on data from 1800-1875

graemeptoday at 8:16 AM

I am sure the the British Library has ensured everything is out of copyright, but just limiting the books to before 1899 is not enough in the UK. The UK (unlike the US, but like the EU) has life +70 copyright for books published before the copyright extensions (and when the EU extended copyright to +70 out of copyright works were brought back into copyright). For example, Shaw's works only came out of copyright in 2020. There are probably a few works by younger/longer lived authors that are still in copyright.

show 1 reply
parpfishtoday at 5:12 AM

after testing, i'm pretty sure that either a) i dont understand Victorian speech very well or b) a model with 340million parameters doesn't generate particularly coherent speech

show 4 replies
kibibutoday at 6:54 AM

The hard turn from this:

> Given how hard it is to train a useful LLM without using vast amounts of scraped, unlicensed data I’ve been dreaming of a model like this for a couple of years now.

To this:

> I got Claude Code to do most of the work

Gave me whiplash

heyethantoday at 6:35 AM

Looks like a model size issue, but the behavior already seems largely shaped by the data distribution.

fastballtoday at 6:48 AM

I wonder if you could generate synthetic Victorian-era training data.

show 1 reply
gen6acd60aftoday at 6:42 AM

    >Honestly, it’s pretty terrible. 

    >But what a fun project!
voidUpdatetoday at 7:15 AM

It may be legally trained, but is it ethically trained? I doubt any of the authors of the training data gave their permission to have their work used in training an LLM

show 3 replies