I would like to see what their process for safety alignment and guardrails is with that model. They give some spicy examples on github, but the responses are tepid and a lot more diplomatic than I would expect.
Moreover, the prose sounds too modern. It seems the base model was trained on a contemporary corpus. Like 30% something modern, 70% Victorian content.
Even with half a dozen samples it doesn't seem distinct enough to represent the era they claim.
Using texts upto 1913 includes works like The Wizard of Oz (1900, with 8 other books upto 1913), two of the Anne of Green Gables books (1908 and 1909), etc. All of which read modern.
The Victorian era (1837-1901) covers works from Charles Dickens and the like which are still fairly modern. These would have been part of the initial training before the alignment to the 1900-cutoff texts which are largely modern in prose with the exception of some archaic language and the lack of technology, events, and language drift post that time period.
And, pulling in works from 1800-1850 you have works by the Bronte's and authors like Edgar Allan Poe who was influential in detective and horror fiction.
Note that other works around the time like Sherlock Holmes span both the initial training (pre-1900) and finetuning (post-1900).