logoalt Hacker News

jp57last Thursday at 9:24 PM2 repliesview on HN

"git doesn't really work ... because docx is a binary blob."

Well, yes, but the binary blob is a zip archive of a directory of text XML files, and one could imagine tooling that wraps the git interaction in an unzip/zip bracket.

The real problem is that lawyers, like basically all other non-programmers, neither know nor care about the sequence of bytes that makes a file in the minds of programmers. In their minds the file IS what they see when they open it in word: a sequence of white rectangles with text laid out on it in specific ways, including tables with borders, etc. The fact that a lot of really complicated stuff goes on inside the file to get the WYSIWYG rendering is not only irrelevant to them, it's unknown.

Maybe the answer here will be along the lines of Karpathy's musings about making LLMs work directly with pixels (images of text), instead of encoded text and tokenizers [1]. An AI tool would take the document visually-standard legal document form, and read it, and produce output with edits, redlines, etc as directed by the user.

[1] https://x.com/karpathy/status/1980397031542989305


Replies

jpbryanlast Thursday at 9:36 PM

Diffing the XML is a complete nonstarter. I've spent years working with the OpenXML format and can assure you it is very complex even for a professional software engineer with 10 years of experience.

The diff of the document (referred to as a "redline") is what lawyers send to the client and their counterparties. It's essential that the redline is legible for all parties and reflects their professionalism.

Moreover, it is not enough to see the structural changes between the versions. A lawyer needs to see the formatting changes between the versions as well which cannot be accomplished by diffing XML files.

show 3 replies
jiggawattslast Thursday at 11:31 PM

Something I've started doing in my workflow is using Pandoc to convert between Markdown and DOCX when authoring long documents. This lets me put the Markdown into Git and apply the Gemini CLI to it. When referencing other documents, I'll also convert them to MD and drop them into a folder so I can tell the AI to read them and cross-reference things.

At the start of the project the Markdown is authoritative, and the DOCX is just for previewing the styling. (Pandoc can insert the text into a layout template with place holders.)

Towards the end of a project I'll start treating the DOCX as authoritative but continue generating Markdown from it, so I can run the AI over it as a final proof-read or whatever.

This is similar to what people used to do with DocBook, but with a more friendly text format and a more AI-friendly "modern" workflow with Git, etc...

show 1 reply