If the Unicode consortium would spend less time and effort on emoji and more on making the most common/important mathematical symbols and notations available/renderable in plain text, maybe we could move past the (LA)TeX/PDF marriage. OpenType and TrueType now (edit: for well over a decade, actually) support the necessary conditional rendering required to perform complicated rendering operations to get sequences of Unicode code points to display in the way needed (theoretically, anyway) and with fallback missing-glyph-only font family substitution support available pretty much everywhere allowing you to seamlessly display symbols not in your primary font from a fallback asset (something like Noto, with every Unicode symbol supported by design, or math-specific fonts like Cambria Math or TeX Gyre, etc), there are no technical restrictions.
I’ve actually dug into this in the past and it was never lack of technical ability that prevented them from even adding just proper superscript/subscript support before, but rather their opinion that this didn’t belong in the symbolic layer. But since emoji abuse/rely on ZWJ and modifiers left and right to display in one of a myriad of variations, there’s really no good reason not to allow the same, because 2 and the squares symbol are not semantically the same (so it’s not a design choice).
An interesting (complete) tangent is that Gemini 3 Pro is the only model I’ve tested (I do a lot of math-related stuff with LLMs) that absolutely will not under any circumstances respect (system/user) prompt requests to avoid inline math mode (aka LATeX) in the output, regardless of whether I asked for a blanket ban on TeX/MathJax/etc or when I insisted that it use extended unicode codes points to substitute all math formula rendering (I primarily use LLMs via the TUI where I don’t have MathJax support, and as familiar as I once was with raw TeX mathematical notations and symbols, it’s still quite easy to confuse unrendered raw output by missing something if you’re not careful). I shared my experiment and results here – Gemini 3 Pro would insist on even rendering single letter constants or variables as $k$ instead of just k (or k in markdown italics, etc) no matter how hard I asked it not to (which makes me think it may have been overfit against raw LATeX papers, and is also an interesting argument in favor of the “VL LLMs are the more natural construct”): https://x.com/NeoSmart/status/1995582721327071367?s=20
As an arXiv author who likes using complicated TeX constructions, the introduction of HTML conversion has increased my workload a lot trying to write fallback macros that render okay after conversion. The conversion is super slow and there is no way to faithfully simulate it locally. Still I think it's a great thing to do.
Is this new or somehow updated? HTML versions of papers have been available for several years now.
EDIT: indeed, it was introduced in 2023: https://blog.arxiv.org/2023/12/21/accessibility-update-arxiv...
I wish epub was more common for papers. I have no idea if there's any real difficulties with that, or just not enough demand.
Accessibility barriers in research are not new, but they are urgent. The message we have heard from our community is that arXiv can have the most impact in the shortest time by offering HTML papers alongside the existing PDF.
It must have been around 1998. I was editor of our school’s newspaper. We were using Corel Draw. At some point, I proposed that we start using HTML instead. In the end, we decided against it, and the reasons were the same that you can read here in the comments now.
>Did you know that 90% of submissions to arXiv are in TeX format, mostly LaTeX? That poses a unique accessibility challenge: to accurately convert from TeX—a very extensible language used in myriad unique ways by authors—to HTML, a language that is much more accessible to screen readers and text-to-speech software, screen magnifiers, and mobile devices.
Challenging. Good work!
Dumb question but what stops browsers from rendering TeX directly (aside from the work to implement it)? I assume it's more than just the rendering
Unfortunately I didn't see the recommendation there on what can be done for old papers. I checked, and only my papers after 2022 have an HTML version. I wish they'd make some kind of 'try html' button for those.
Pandoc can convert to svg. It can then be inlined in html. Looks just like latex, though copy/paste isn't very useful
[Sept 2023] as per the wayback machine.
Seeing the Gemini 3 capabilities, I can imagine a near future where file formats are effectively irrelevant.
I don't think HTML is the right approach. HTML is better than PDF, but it is still a format for displaying/rendering.
the actual paper content format should be separated from its rendering.
i.e. it should contain abstract, sections, equations, figures, citations etc. but it shouldn't have font sizes, layout etc.
the viewer platforms then should be able to style the content differently.
Can't help but wonder if this was motivated in part by people feeding papers into LLMs for summary, search, or review. PDF is awful for LLMs. You're effectively pigeonholed into using (PAYING for) Adobe's proprietary app and models which barely hold a candle to Gemini or Claude. There are PDF-to-text converters, but they often munge up the formatting.
It's extremely easy to convert HTML/CSS to a PDF with the print to PDF feature of the browser.
All papers should be in HTML/CSS or Tex then just simply converted to PDF.
Why are we even talking about this?
Wasn't the World Wide Web invented at CERN specifically for sharing scientific papers? Why are we still using PDFs at all?
This is not new, the title should say (2023). They have shipped the HTML feature with "experimental" flag for two years now, but I don't know whether there is even any plan to move out of the experimental phase.
It's not much of an "experiment" if you don't plan to use some experimental data to improve things somehow.
I was reading through this article too, glad to have found it on here
Maybe unpopular, but papers should be in n markdown flavor to be determined. Just to have them more machine readable.
Why do we like HTML more than pdfs?
HTML rendering requires you to be connected to the internet, or setting up the images and mathJax locally. A PDF just works.
HTML obviously supports dynamic embedding, such as programs, much better but people just usually post a github.io page with the paper.
Hi, an arXiv HTML Papers developer here.
As a very brief update - we are pending a larger update.
You will spot many (many) issues with our current coverage and fidelity of the paper rendering. When they jump at you, please report them to us. All reports from the last 2 years have landed on github. We have made a bit of progress since, but there are (a lot of) more low-hanging fruit to pick.
Project issues:
https://github.com/arXiv/html_feedback/issues/
The main bottleneck at the moment is developer time. And the main vehicle for improvements on the LaTeX side of things continues to be LaTeXML. Happy to field any questions.