logoalt Hacker News

nilirltoday at 12:18 PM1 replyview on HN

One thing I've struggled with before is building a collection of data models based off of a collection of PDF forms.

I wanted to abstract away the PDF form building my own html form on top of a data model that can later be used to programmatically fill the PDF .

Since I had 100s of PDFs, I wanted an OCR+LLM pipeline to build a data model for each PDF. Unfortunately, OCR + LLM works ~90% of the time but sometimes fields are missed or mislabeled in the data model.

Does this sometimes get it wrong during programmatic filling? How do you deal with that?


Replies

niptoday at 12:33 PM

[dead]