logoalt Hacker News

cahayayesterday at 7:14 PM2 repliesview on HN

I can confirm. When trying convert simple Word sentences and tables to e.g. Markdown/HTML from a Word XML you need a PhD in XML edge cases and nested garbage.


Replies

paulbjensenyesterday at 7:41 PM

I wonder if this tool by MSFT is able to handle that:

https://github.com/microsoft/markitdown

I was amazed when I realised that Word docs were just zip files and you could poke around in the xml files embedded inside of them.

I almost implemented a working React -> Word document renderer back in 2017, but it didn't have support for creating the xml tags with : inside of them (which OOXML documents use).

show 2 replies
superjanyesterday at 8:03 PM

Well, it is not pretty to see how the sausage gets made, but extracting formatted text from docx is absolutely doable, no PhD involved. Source: I have done it as a little sidequest because it was useful to audit a set of word documents.