logoalt Hacker News

Replacing a 3 GB SQLite db with a 10 MB FST (finite state transducer) binary

119 pointsby hiAndrewQuinntoday at 10:33 AM19 commentsview on HN

Comments

Hendriktotoday at 12:18 PM

> I do wish to point out, of course, that the whole reason it was possible to experiment cheaply and come across this serendipity was because 9 months ago, faced with the choice to either do the bad easy thing or the good nothing, I chose to do the bad easy thing.5 The SQLite database worked! I understood how it worked, behind the scenes with its B-trees and its Full Text Search extension.

This is the most important takeaway, imo, and a very valuable technique: Start with the obvious, stupid solution that definitely works. Then do the optimized version, while making sure it matches the naive implementation. In this case, the optimized version could even be generated from the naive one.

show 4 replies
wood_spirittoday at 3:48 PM

This was a fun read! Thanks for the great introduction to Finite State Transducers. I hadn't heard the formal term before, but your article gave me serious déjà vu.

Years ago, I entered a Scrabble programming contest and needed to compress a GADDAG dictionary to fit into my 6MB L3 cache. Without knowing the official name for it, I ended up using the exact same suffix-compression mechanism by moving characters to the edges instead of the nodes to merge overlapping paths.

Sharing my old write-up here in case you or other data-structure nerds find the overlap interesting! https://williame.github.io/post/87682811573.html

lscharentoday at 11:59 AM

I was halfway through the article and began thinking that his described data structure sounded very familiar to something I used about 20 years ago.

Sure enough, the first paragraph on the Wikipedia entry for DAFSA is:

DAFSA is the rediscovery of a data structure called Directed Acyclic Word Graph (DAWG)

show 3 replies
hmokiguesstoday at 3:49 PM

This was such a pleasant read, thank you for sharing your experience, I feel motivated now to go solve the same problems twice!

cadamsdotcomtoday at 3:10 PM

Why was the download 3gb, if the solution created a 300x reduction primarily by sharing suffixes? Wouldn’t vanilla compression have dealt with that and achieved a decent (not ideal) amount of compression of the database?

show 1 reply
asibahitoday at 12:41 PM

This was a very interesting read. I wonder if similar techniques can apply to Turkish or Japanese dictionaries?

show 1 reply