I really enjoyed using them for my Ruby file-matching library where I wanted to read `shared-mime-in...

Lammy • today at 8:08 AM • 0 replies • view on HN

I really enjoyed using them for my Ruby file-matching library where I wanted to read `shared-mime-info` XML source package files directly and on the fly as opposed to using the pre-processed secondary files that the upstream `update-mime-database` tool spits out. The problem is that a type definition can be spread out over multiple XML packages in both system and user paths, so the naïve implementation of reading them all at once wastes a massive amount of memory and a massive number of object allocations (slow) when most people use maybe 5% of the full set of supported types (the JPEGs and HTMLs and ZIPs of the world).

I wanted to read the source package files directly because I always found `shared-mime-info`'s usual two-step process for adding or editing any of the XML type data to be annoyingly difficult and fragile. One must run `update-mime-database` to decompose arbitrarily-many XML packages into a set of secondary files, one all-file-extensions, one all-magic-sequences, one all-aliases, etc. System package managers usually script that step when installing software that come with their own type data. I've accidentally nuked my entire MATE session with `update-mime-database` before when I wanted to pick up a manual addition and regenerated the secondary files while accidentally excluding the system path that had most of the data.

I ended up doing it with four Ractors:

- a Ractor matching inputs (MIME Type strings, file extensions, String or Pathname or URL paths for sniffing) against its loaded fully-formed type definition objects.

- a Ractor for parsing MIME Type strings (e.g. "application/xml") into Hash-keying Structs, a task for which the raw String is unsuitable since it may be overloaded with extra syntax like "+encoding_name" or fragment ";key=value" pairs.

- a fast XML-parser Ractor that takes in the key Structs (multiple at once to minimize necessary number of passes) and figures out whether or not any of those types are defined at all, and if so in which XML packages.

- a slow XML-parser Ractor that takes the same set of multiple key Structs and loads their full definition into a complete type object, then passes the loaded objects back to the matcher Ractor.

The cool part of doing it this way is that it frees up the matcher Ractor to continue servicing other callers off its already-loaded data when it gets a request for a novel type and needs to have its loader Ractors do their comparatively-slow work. The matcher sets the unmatched inputs aside until the loaders get back to it with either a loaded type object or `nil` for each key Struct, and it remembers `nil`s for a while to avoid having to re-run the loading process for inputs that would be a waste of time.

The last pre-Ractorized version allocated around 200k objects in 7MiB memory and retained 17k objects in 2MiB of memory for a benchmark run on a single input, with a complete data load. The Ractorized version was twice as fast in the same synthetic benchmark and allocated 20k objects in 2MiB of memory and retained 2.5k objects in 260KiB of memory for its initial minimal data load. I have it explicitly load `application/xml` and `application/zip` since those combined are the parent types for like a third of all the other types, and a few other very common types of my choosing.

I think a lot of the barrier to entry for Ractors isn't the API for the Ractors themselves but in figuring out how to interact with Ractorized code from code that hasn't been explicitly Ractorized (i.e. is running in the invisible “main” Ractor). To that end I found it easiest to emulate my traditional library API by providing synchronous entry-point methods that make it feel no different to use than any other library despite all the stuff that goes on behind the scenes. The entry methods compose a message to the matcher Ractor then block waiting for a result or a timeout.

I also use Ractors in a more lightweight way in my UUID/GUID library where there's a Ractor serving the incrementing sequence value that serves as a disambiguator for time-based UUIDs in case multiple other Ractors (including invisible “main”) generate two UUIDs with the same timestamp. Speaking of which, I'm going to have to work on this one for Ruby 4.0, because it uses the removed `Ractor.take` method.

alt Hacker News