logoalt Hacker News

Microsoft offers guide to pirating Harry Potter series for LLM training

210 pointsby anonymous908213yesterday at 11:19 PM122 commentsview on HN

Comments

mcnytoday at 12:27 AM

You guys are talking about copyright but I think a bigger takeaway is there is a process breakdown at Microsoft. Nobody is reading or reviewing these documentation so what hope is there that anybody is reading or reviewing their new code?

I guess the question to leadership is that two of the three pillars , namely security and quality are at odds with the third pillar— AI innovation. Which side do you pick?

(I know you mean well and I love you, Scott Hanselman but please don't answer this yourself. Please pass this on to the leadership.)

show 5 replies
camkegotoday at 12:09 AM

The real cherry on top, is that the Microsoft link from the blog post by the Microsoft senior product manager goes to a Kaggle dataset page claiming the dataset is CC0: Public Domain.

https://www.kaggle.com/datasets/shubhammaindola/harry-potter...

More than just using the data, it seems linking to a copy that claims the dataset is public domain, would be problematic copyright-wise.

Also interesting, this blog post has been up since November of 2024, very surprising to me that Microsoft hasn't taken it down yet.

show 1 reply
WillMorrtoday at 12:21 AM

Since IP law is apparently dead, does anyone want to invest in my ai generated novel startup where it just spits out Harry Potter verbatim but uses a bunch of power to do so.

show 4 replies
pbrumtoday at 12:30 AM

Update: Microsoft has taken the page down. But posterity being what it is...

https://archive.is/D9vEN

show 3 replies
beached_whaletoday at 12:03 AM

The AI generated thumbnail, https://devblogs.microsoft.com/azure-sql/wp-content/uploads/..., is that of young Harry and friend with a prominent MS logo. Wow

throwaway150today at 2:04 AM

Page is gone.

Archived copy: https://web.archive.org/web/20260105115129/https://devblogs....

It is very worrying that people with no ethics work for these trillion dollar companies who are supposed to be shaping the technology of tomorrow.

andsoitisyesterday at 11:29 PM

This article is from 2024 and points to Kaggle, which hosts the data set.

I'm surprised that JKR's people haven't come down like a tonne of bricks on Kaggle / Microsoft.

Does anyone know whether there is some special reason why this has lasted so long without being taken down?

show 2 replies
protocolturetoday at 2:51 AM

It doesnt offer a guide to piracy, it offers a guide on including specific data from a dataset into SQL so it can be referenced by an LLM.

If anything Kaggle would be on the hook for including the data as CC0. Or perhaps to Shubham Maindola for uploading it. In fact the "provenance" listed would give me chills. Crazy how this got a 10.0 score. "I downloaded the ebooks of Harry Potter. Then converted them to txt files."

thrKanyesterday at 11:52 PM

In case the page disappears:

https://archive.is/7WLho

show 2 replies
dom96yesterday at 11:51 PM

How soon before someone will be able to make an online library which generates the original books using LLMs? Surely popular titles like Harry Potter may end up so well represented in the training that we'll get the full books out of the LLM with a close to 100% accuracy?

show 2 replies
robrainyesterday at 11:58 PM

Original title: "LangChain Integration for Vector Support for SQL-based AI applications"

show 1 reply
electronsouptoday at 12:20 AM

I guess the end of copyright is near if this is fine to put on a corporate website

show 1 reply
fxwinyesterday at 11:53 PM

I feel like the title is a bit misleading, unless the person who put all HP books on Kaggle as a (supposedly) CC0-licensed data set did so as a Microsoft employee.

Nevertheless pretty egregious oversight (incompetence?) and something that shouldn't have been published.

show 4 replies
til_somethingtoday at 12:48 AM

I can still get to the article on the site, perhaps it’s cached in the CDN somewhere. Also, reviewing the repo the full entire article is there which promotes the same silly things. https://github.com/Azure-Samples/azure-sql-db-vector-search/...

show 2 replies
rfc2324today at 12:37 AM

Jupyter notebook version here for the curious: https://github.com/Azure-Samples/azure-sql-db-vector-search/...

arkensawyesterday at 11:40 PM

My guess is HP makes such an enormous amount of money already from movies, games, toys, and other tie-ins, that they can't be bothered to chase down the odd digital infringement of a plain text copy of the original books.

I'm sure the scripts of Star Wars would be similarly ignored if they were used.

show 1 reply
bvantoday at 2:11 AM

Wonderful 404 page. Wonder if Kai Lentit optimized it.

bryan_wtoday at 12:28 AM

I guess legal was a part of the layoff these past few years. Too bad we can't get a bounty from the RIAA of books, whatever that is

miffy900today at 12:38 AM

I recall the source code for Windows XP was leaked some years ago; not just isolated parts of the code base, like with the earlier Windows NT4/2000 source code leak, but a completely buildable repository.

If I write an article on training an LLM on the leaked Windows XP source code, blithely mark the source code repo as in 'the public domain', but used Azure resources for the how-to steps, would that would make it OK Microsoft? You know, your Azure division might get some money...

Seriously, this is just so...blatant. It's like we've all collectively decided that copyright just doesn't matter anymore. Just readin this article, I feel like I'm taking crazy pills.

starkeepertoday at 1:32 AM

They tore the page down any copies?

rob_ctoday at 1:16 AM

I... There are parts of the world where certain developers don't understand the way the west tends to work with regard to copyright, or not blindly copying anything that is out there.

This however is a very, VERY poor situation when you end up placing your employer at risk because you think copyright doesn't matter and everything on the internet is fair game.

This is probably the most polite way I would describe this to most, UG. For the rest, jus stop acting like cheating through a situation to get a step up is the norm, it's just dirty behaviour.

ecetoday at 2:18 AM

If copyrighted materials are used, surely copyright allows for the maker to require disclosure that their content was used in training a model.

wewewedxfgdftoday at 12:22 AM

Refreshingly honest.

blibbletoday at 1:28 AM

"but it's fair use"

Rowling is known for actively protecting her rights as an author, they couldn't have picked a worse author to slop up

ThrowawayTestryesterday at 11:57 PM

Absolutely shameless

thehamkercattoday at 12:26 AM

It's taken down lmao, in 1 hour

show 1 reply
outside1234today at 12:26 AM

I mean they are also offering up the code you are writing in your private repos to LLMs to regenerate in my repo, so let's just go nuts.

conartist6yesterday at 11:48 PM

What in the absolute fuck

selridgetoday at 12:22 AM

Someone forgot the national no snitching rules, and in service of Jo, no less.

Everyone should torrent and rip off those books, anyway.

charcircuittoday at 2:24 AM

This is fair use as it is for educational purposes and not for reading.

show 1 reply