logoalt Hacker News

souvik333306/16/20252 repliesview on HN

Actually, we have trained the model to convert to markdown and do semantic tagging at the same time. Eg, the equations will be extracted as LaTeX equations, and images (plots, figures, and so on) will be described within the `<img>` tags. Same with `<signature>`, `<watermark>`, <page_number>.

Also, we extract the tables as HTML tables instead of markdown for complex tables.


Replies

mgr8606/16/2025

Have you considered XML. TEI, for example, is very robust and mature for marking up documents.

show 2 replies
jtbayly06/16/2025

What happens to footnotes?

show 1 reply