Mr. Chatterbox is a Victorian-era ethically trained model

54 points • by y1n0 • today at 2:26 AM • 35 comments • view on HN

Comments

One thing I think would be very useful here is national archive data: there will be thousands of letters, memos and official documents shared between people alive back then under the care of a museum or government.

One of my dreams is to help digitise and make available the thousands of Second World War-era documents in the National Archives at Kew.

We’re at the point where a simple phone camera and a robust LLM-powered process can digitise ENORMOUS amounts of archive material almost effortlessly [1]. This is going to be enormous for historians eager to dive into the millions of interesting primary sources.

[1 https://generativehistory.substack.com/p/gemini-3-solves-han...]

bossyTeacher • today at 11:11 AM

Prompt: do you know what america is?

Response: Indeed! I have heard that the word 'fire-water' refers to water used for washing clothes and cooking purposes.

lovelearning • today at 5:40 AM

I thought the title meant the training data used was ethics content and ethical reasoning. Turns out "ethically trained" means the training data used doesn't violate copyright laws.

➕ show 3 replies

kgeist • today at 5:50 AM

Prior art: https://news.ycombinator.com/item?id=46590280

>TimeCapsuleLLM: LLM trained only on data from 1800-1875

graemep • today at 8:16 AM

I am sure the the British Library has ensured everything is out of copyright, but just limiting the books to before 1899 is not enough in the UK. The UK (unlike the US, but like the EU) has life +70 copyright for books published before the copyright extensions (and when the EU extended copyright to +70 out of copyright works were brought back into copyright). For example, Shaw's works only came out of copyright in 2020. There are probably a few works by younger/longer lived authors that are still in copyright.

➕ show 1 reply

parpfish • today at 5:12 AM

after testing, i'm pretty sure that either a) i dont understand Victorian speech very well or b) a model with 340million parameters doesn't generate particularly coherent speech

➕ show 4 replies

kibibu • today at 6:54 AM

The hard turn from this:

> Given how hard it is to train a useful LLM without using vast amounts of scraped, unlicensed data I’ve been dreaming of a model like this for a couple of years now.

To this:

> I got Claude Code to do most of the work

Gave me whiplash

heyethan • today at 6:35 AM

Looks like a model size issue, but the behavior already seems largely shaped by the data distribution.

fastball • today at 6:48 AM

I wonder if you could generate synthetic Victorian-era training data.

➕ show 1 reply

gen6acd60af • today at 6:42 AM

    >Honestly, it’s pretty terrible. 

    >But what a fun project!

voidUpdate • today at 7:15 AM

It may be legally trained, but is it ethically trained? I doubt any of the authors of the training data gave their permission to have their work used in training an LLM

➕ show 3 replies

alt Hacker News

Mr. Chatterbox is a Victorian-era ethically trained model

Comments