logoalt Hacker News

el_don_almighty06/16/20252 repliesview on HN

I have been looking for something that would ingest a decade of old Word and PowerPoint documents and convert them into a standardized format where the individual elements could be repurposed for other formats. This seems like a critical building block for a system that would accomplish this task.

Now I need a catalog, archive, or historian function that archives and pulls the elements easily. Amazing work!


Replies

pxc06/16/2025

Can't you just start with unoconv or pandoc, then maybe use an LLM to clean up after converting to plain text?

toledocavani06/17/2025

Which decade? DOCX and PPTX is just zipped XMLs, seems pretty standard to me