I've developed a Python API service that uses GPT-4o for OCR on PDFs. It features parallel processing and batch handling for improved performance. Not only does it convert PDF to markdown, but it also describes the images within the PDF using captions like `[Image: This picture shows 4 people waving]`.
In testing with NASA's Apollo 17 flight documents, it successfully converted complex, multi-oriented pages into well-structured Markdown.
The project is open-source and available on GitHub. Feedback is welcome.
Speaking of the devil - I've just had hallucinations with Ollama and reader-lm model (converting HTML to Markdown) the other day. In 40% of cases it spew out things that weren't in the input (not exactly surprising, knowing that it's a generative model).
Turns out the model needs temperature of zero (and then it seem to behave well, at least in simple tests), but it wasn't in the model settings.
https://github.com/ollama/ollama/issues/6875#issuecomment-23...
You're spot on. We shouldn't lump all LLMs together. This approach might work wonders for Anthropic and OpenAI's top-tier models, but it could fall flat with smaller, less complex ones.
I purposely set the temperature to 0.1, thinking the LLM might need a little wiggle room when whipping up those markdown tables. You know, just enough leeway to get creative if needed.
I get your worries about LLMs and their consistency problems. But I think we can fix a lot of that using LLMs themselves for checks. If you're after top-notch accuracy, you could throw in another prompt, add some visual and text input, and double-check that nothing's lost in translation. The cheaper models are actually great for this kind of quality control. LLMs have come a long way since they first showed up, and I reckon they've stepped up their game enough to shake off that bad rap for giving mixed signals.
How would you know something is missing?
I tried multiple OCRs before and it’s hard to tell if the output is accurate or not but just comparing manually.
I created a tool to visualise the output of OCR [0] to see what’s missing and there are many cases that would be quite concerning especially when working with financial data.
This tool wouldn’t work with LLMs as they don’t return the character recognition (to my knowledge), which will make it harder to evaluate them on a scale.
If I want to use LLMs for the task, I would use them to help with training ML model to do OCR better, such as creating thousands of synthetic data to train.
[0] https://github.com/orasik/parsevision
Wow, you knocked it out of the park! I'll be sure to use this when I tackle that evaluation.
If you can use an LLM for sanity checking, why can’t you use it for extraction at the first place?
Because currently models output a stream of tokens directly which are the performance and billing unit. Better models can do a better job at producing reasonable output but there is a limit to what can be done "on the fly".
Some models like openai o1 started employing internal "thinking" tokens which may or may not be equivalent to performing multiple passes with the same or different models but it has a similar effect.
One way to look at it is that if you want better results you have to put more computational resources in thinking. Also, just like humans, a team effort yields better results in producing well rounded results because you combine the strengths and you offset the weaknesses of different team members.
You can technically wrap all this into a single black box and have it converse with you as if it was one single entity that internally uses multiple models to think and cross check etc. The output is likely not going to be in real-time though and real time conversation was until now a very important feature.
In future we may on one hand relax the real time constraint and accept that for some tasks accuracy is more important than real time results.
Or we may eventually have faster machines or more clever algorithms that may "think" more in shorter amounts of time.
(Or a combination of the two)
Determinism is also up there because post processing can catch and fix common errors