logoalt Hacker News

HTML as an Accessible Format for Papers

183 pointsby el3ctrontoday at 2:59 PM96 commentsview on HN

Comments

dginevtoday at 10:15 PM

Hi, an arXiv HTML Papers developer here.

As a very brief update - we are pending a larger update.

You will spot many (many) issues with our current coverage and fidelity of the paper rendering. When they jump at you, please report them to us. All reports from the last 2 years have landed on github. We have made a bit of progress since, but there are (a lot of) more low-hanging fruit to pick.

Project issues:

https://github.com/arXiv/html_feedback/issues/

The main bottleneck at the moment is developer time. And the main vehicle for improvements on the LaTeX side of things continues to be LaTeXML. Happy to field any questions.

ComputerGurutoday at 7:21 PM

If the Unicode consortium would spend less time and effort on emoji and more on making the most common/important mathematical symbols and notations available/renderable in plain text, maybe we could move past the (LA)TeX/PDF marriage. OpenType and TrueType now (edit: for well over a decade, actually) support the necessary conditional rendering required to perform complicated rendering operations to get sequences of Unicode code points to display in the way needed (theoretically, anyway) and with fallback missing-glyph-only font family substitution support available pretty much everywhere allowing you to seamlessly display symbols not in your primary font from a fallback asset (something like Noto, with every Unicode symbol supported by design, or math-specific fonts like Cambria Math or TeX Gyre, etc), there are no technical restrictions.

I’ve actually dug into this in the past and it was never lack of technical ability that prevented them from even adding just proper superscript/subscript support before, but rather their opinion that this didn’t belong in the symbolic layer. But since emoji abuse/rely on ZWJ and modifiers left and right to display in one of a myriad of variations, there’s really no good reason not to allow the same, because 2 and the squares symbol are not semantically the same (so it’s not a design choice).

An interesting (complete) tangent is that Gemini 3 Pro is the only model I’ve tested (I do a lot of math-related stuff with LLMs) that absolutely will not under any circumstances respect (system/user) prompt requests to avoid inline math mode (aka LATeX) in the output, regardless of whether I asked for a blanket ban on TeX/MathJax/etc or when I insisted that it use extended unicode codes points to substitute all math formula rendering (I primarily use LLMs via the TUI where I don’t have MathJax support, and as familiar as I once was with raw TeX mathematical notations and symbols, it’s still quite easy to confuse unrendered raw output by missing something if you’re not careful). I shared my experiment and results here – Gemini 3 Pro would insist on even rendering single letter constants or variables as $k$ instead of just k (or k in markdown italics, etc) no matter how hard I asked it not to (which makes me think it may have been overfit against raw LATeX papers, and is also an interesting argument in favor of the “VL LLMs are the more natural construct”): https://x.com/NeoSmart/status/1995582721327071367?s=20

show 2 replies
DominikPeterstoday at 6:46 PM

As an arXiv author who likes using complicated TeX constructions, the introduction of HTML conversion has increased my workload a lot trying to write fallback macros that render okay after conversion. The conversion is super slow and there is no way to faithfully simulate it locally. Still I think it's a great thing to do.

show 1 reply
ForceBrutoday at 3:51 PM

Is this new or somehow updated? HTML versions of papers have been available for several years now.

EDIT: indeed, it was introduced in 2023: https://blog.arxiv.org/2023/12/21/accessibility-update-arxiv...

show 2 replies
ekjhgkejhgktoday at 5:45 PM

I wish epub was more common for papers. I have no idea if there's any real difficulties with that, or just not enough demand.

show 3 replies
el3ctrontoday at 2:59 PM

Accessibility barriers in research are not new, but they are urgent. The message we have heard from our community is that arXiv can have the most impact in the shortest time by offering HTML papers alongside the existing PDF.

show 1 reply
leobgtoday at 5:56 PM

It must have been around 1998. I was editor of our school’s newspaper. We were using Corel Draw. At some point, I proposed that we start using HTML instead. In the end, we decided against it, and the reasons were the same that you can read here in the comments now.

Barbingtoday at 3:57 PM

>Did you know that 90% of submissions to arXiv are in TeX format, mostly LaTeX? That poses a unique accessibility challenge: to accurately convert from TeX—a very extensible language used in myriad unique ways by authors—to HTML, a language that is much more accessible to screen readers and text-to-speech software, screen magnifiers, and mobile devices.

Challenging. Good work!

percentcertoday at 8:06 PM

Dumb question but what stops browsers from rendering TeX directly (aside from the work to implement it)? I assume it's more than just the rendering

show 2 replies
sega_saitoday at 3:59 PM

Unfortunately I didn't see the recommendation there on what can be done for old papers. I checked, and only my papers after 2022 have an HTML version. I wish they'd make some kind of 'try html' button for those.

show 1 reply
jas39today at 4:13 PM

Pandoc can convert to svg. It can then be inlined in html. Looks just like latex, though copy/paste isn't very useful

show 1 reply
sundarurfriendtoday at 4:23 PM

[Sept 2023] as per the wayback machine.

naterolingtoday at 4:06 PM

Seeing the Gemini 3 capabilities, I can imagine a near future where file formats are effectively irrelevant.

show 5 replies
billconantoday at 4:42 PM

I don't think HTML is the right approach. HTML is better than PDF, but it is still a format for displaying/rendering.

the actual paper content format should be separated from its rendering.

i.e. it should contain abstract, sections, equations, figures, citations etc. but it shouldn't have font sizes, layout etc.

the viewer platforms then should be able to style the content differently.

show 5 replies
ashleyntoday at 3:54 PM

Can't help but wonder if this was motivated in part by people feeding papers into LLMs for summary, search, or review. PDF is awful for LLMs. You're effectively pigeonholed into using (PAYING for) Adobe's proprietary app and models which barely hold a candle to Gemini or Claude. There are PDF-to-text converters, but they often munge up the formatting.

show 1 reply
teddy-smithtoday at 5:12 PM

It's extremely easy to convert HTML/CSS to a PDF with the print to PDF feature of the browser.

All papers should be in HTML/CSS or Tex then just simply converted to PDF.

Why are we even talking about this?

show 5 replies
_dain_today at 6:46 PM

Wasn't the World Wide Web invented at CERN specifically for sharing scientific papers? Why are we still using PDFs at all?

show 1 reply
cubefoxtoday at 5:24 PM

This is not new, the title should say (2023). They have shipped the HTML feature with "experimental" flag for two years now, but I don't know whether there is even any plan to move out of the experimental phase.

It's not much of an "experiment" if you don't plan to use some experimental data to improve things somehow.

lalithaartoday at 3:48 PM

I was reading through this article too, glad to have found it on here

rootnod3today at 4:01 PM

Maybe unpopular, but papers should be in n markdown flavor to be determined. Just to have them more machine readable.

show 2 replies
vatsachaktoday at 5:10 PM

Why do we like HTML more than pdfs?

HTML rendering requires you to be connected to the internet, or setting up the images and mathJax locally. A PDF just works.

HTML obviously supports dynamic embedding, such as programs, much better but people just usually post a github.io page with the paper.

show 4 replies