logoalt Hacker News

dv35zlast Saturday at 2:39 PM4 repliesview on HN

Curious if anyone has strategies on how to perform parallel writes to an SQLite database using Python's `multiprocessing` Pool.

I am using it to loop through a database of 11,000 words, hit an HTTP API for each (ChatGPT) and generate example sentences for the word. I would love to be able to asynchronously launch these API calls and have them come back and update the database row when ready, but not sure how to handle the database getting hit by all these writes from (as I understand it) multiple instances of the same Python program/function.


Replies

zielast Saturday at 3:13 PM

Technically SQLite can only have 1 writer at any given moment, but it can appear like it works across multiple writers and let it serialize the calls for you.

By default SQLite will not do what you want out of the box. You have to turn on some feature flags(PRAGMA) to get it to behave for you. You need WAL mode, etc read:

* https://kerkour.com/sqlite-for-servers * https://zeroclarkthirty.com/2024-10-19-sqlite-database-is-lo...

My larger question is why multiprocessing? this looks like an IO heavy workload, not CPU bound, so python asyncio or python threads would probably do you better.

multiprocessing is when your resource hog is CPU(probably 1 python process per CPU), not IO bound.

show 1 reply
sethevlast Saturday at 3:22 PM

Have you tried it?

What you're describing sounds like it would work fine to me. The blog post is misleading imho - it implies that SQLite doesn't handle concurrency at all. In reality, you can perform a bunch of writes in parallel and SQLite will handle running them one after the other internally. This works across applications and processes, you just need to use SQLite to interact with the database. The blog post is also misleading when it implies that the application has to manage access to the database file in some way.

Yes, it's correct that only one of those writes will execute at a time but it's not like you have to account for that in your code, especially in a batch-style process like you're describing. In your Python code, you'll just update a row and it will look like that happens concurrently with other updates.

I'll bet that your call to ChatGPT will take far longer than updating the row, even accounting for time when the write is waiting for its turn in SQLite.

Use WAL-mode for the best performance (and to reduce SQLITE_BUSY errors).

show 1 reply
crazygringoyesterday at 5:03 PM

It should just work.

If one thread is writing another thread tries to write, the first thread will have the file write lock, and the second thread will wait to write until that lock is released.

I've written code using the pattern you describe and it's totally fine.

mickeyplast Saturday at 2:42 PM

Edit: disregard. I read it as he'd done it and had contention problems.

You can't. You have a single writer - it's one of the many reasons sqlite is terrible for serious work.

You'll need a multiprocessing Queue and a writer that picks off sentences one by one and commits it.

show 1 reply