logoalt Hacker News

paulbjensenlast Friday at 7:41 PM2 repliesview on HN

I wonder if this tool by MSFT is able to handle that:

https://github.com/microsoft/markitdown

I was amazed when I realised that Word docs were just zip files and you could poke around in the xml files embedded inside of them.

I almost implemented a working React -> Word document renderer back in 2017, but it didn't have support for creating the xml tags with : inside of them (which OOXML documents use).


Replies

favoritedlast Friday at 8:17 PM

Even though markitdown is a Microsoft project, it's just a thin wrapper around a bunch of 3rd party Python packages. For example, to go from docx to Markdown, it uses mammoth to convert docx to HTML[0], then uses markdownify to convert the HTML into Markdown[1].

[0]https://github.com/microsoft/markitdown/blob/da7bcea527ed04c... [1]https://github.com/microsoft/markitdown/blob/da7bcea527ed04c...

strongpigeonlast Friday at 8:08 PM

Technically, they're a bit more than just zip files (they're OPC containers [0]), but if you're hand editing the file content it doesn't really matter.

[0] Open Package Convention: https://en.wikipedia.org/wiki/Open_Packaging_Conventions