logoalt Hacker News

b112yesterday at 7:16 AM1 replyview on HN

I applaud your efforts, but that seems difficult to me. There's so much nuance in language, and the original spanish translation would even be dependent upon locale-destination of the original dictionary. Which would also be time based, as language changes over time.

And that translation is likely only a rough approximation, as words don't often translate directly. To add in an extra layer (spanish -> english) seems like another layer of imperfect (due to language) abstraction.

Of course your efforts are targeting a niche, so likely people will understand the attempt and be thankful. I hope this suggestion isn't too forward, but this being an electronic version, you could allow some way for the original spanish to be shown if desired. That sort of functionality would be quite helpful, even non-native spanish speakers might get a clearer picture.

What tools are you using to abstract all of this?

If the spacing and columns of the images are consistent, I'd think imagemagick would allow you to automate extraction by column (eg, cutting the individual pages up), and OCR could then get to work.

For the Shipibo side, I'd want to turn off all LLM interpretation. That tends to use known groupings of words to probabilistically determine best-match, and that'd wreak havoc in this case.

Back to the images, once you have imagemagick chop and sort, writing a very short script to iterate over the pages, display them, and prompt with y/n would be a massive time saver. Doing so at each step would be helpful.

For example, one step? Cut off header and footer, save to dir. Using helpful naming conventions (page-1, and page-1-noheader_footer). You could then use imagemagick to combine page-1 and -age-1-noheader_footer side by side.

Now run a simple bash vet script. Each of 500 pages pops up, you instantly see the original and the cut result, and you hit y or n. One could go through 500 pages like this in 10 to 20 minutes, and you'd be left with a small subset of pages that didn't get cut properly (extra large footer or whatever). If it's down to 10 pages or some such, that's an easy tweak and fix for those.

Once done, you could do the same for column cuts. You'd already have all the scripts, so it's just tweaking.

I'm mentioning all of this, because combo of automation plus human intervention is often the best method to something such as this.

Anyhow, good luck!


Replies

temp0826yesterday at 2:20 PM

Thanks for the suggestions, I do appreciate it. I was being pretty brief with my post but I really have spent a lot of time and tried this from a number of angles. I've had good luck with non-LLM tools to do the initial OCR, but it's not context aware especially about column/page breaks (like I mentioned it's kind of a dirty scan, and if the breaks happen on a Shipibo part it barfs a bit. Good for a rough search at least).

I would love to create a json version of it that would essentially have a bunch of fields for each word (Shipibo/Spanish/English word/definition/example, type of word, etc). It's further complicated by how words can be modified in Shipibo (it's actually a very technical language- words can have any number of prefixes and suffixes tagged on to change their meaning and their precision. In their "icaros", the healing songs they sing in ceremony, the most technical use of the language is considered to be the most beautiful. Essentially poetry from their "medical" jargon).

I've done some human-in-the-loop attempts but still come up short in one way or another (I end up getting frustrated and throwing my hands up after seeing how much time I dump on it). So I figure this will remain a good test as the tools (and my prompting abilities) get better. It's definitely not urgent for me.