Using LLMs and Human-in-the-Loop for reliable document processing

In recent years, Large Language Models (LLMs) like GPT-4, Claude 2, and Gemini have made impressive advances in extracting and understanding data from complex documents such as contracts, invoices, and receipts. But like even the sharpest human analyst, AI can and will make mistakes. In business-critical workflows, where there is little room for errors, relying on a fully autonomous system isn't just risky- it's irresponsible. That's why businesses should adopt AI agents with partial autonomy: AI systems that include humans in the loop.

Stig Zerener

•

5 min

Last week, Andrej Karpathy, former director of AI at Tesla and a founding member of OpenAI, gave a talk at Y Combinator's AI Startup School where he reflected on the importance of human-AI collaboration. He recalled when a friend working at Google's self-driving company, now Waymo, gave him a ride in a self-driving car in 2013:

We got into this car and we went for an about 30-minute drive around Palo Alto, highways, streets and so on, and that drive was perfect. Zero intervention. And this was 2013, which is now 12 years ago. It kind of struck me because at the time when I had this perfect drive, this perfect demo, I felt like "self-driving is imminent because this just works. This is incredible." But here we are, 12 years later, and we are still working on autonomy. We are still working on driving (AI) agents. Even now, we haven't actually solved the problem.

Karpathy emphasizes the need to keep "AI on a leash." When implementing AI solutions, he suggests building partially autonomous workflows with AI and human verification working hand-in-hand.

I would argue that keeping humans in the loop is almost just as important in document processing as in self-driving cars...

Confidence scores: Knowing when the AI is uncertain

Modern LLMs can assign confidence scores to each prediction, indicating how certain the model is about its output. For example, results from a processed invoice can look like this:

Invoice Date: "2025-05-10" – 98% confidence
Vendor Name: "HPL Technologies" – 95% confidence
Total Amount: "$3,287.99" – 57% confidence

By setting a confidence threshold, businesses can route only low-confidence predictions to human reviewers, letting the system handle the rest autonomously.

This selective review approach offers two major benefits:

Automation: Most documents are processed without human intervention.
Accuracy: Humans focus only where they're truly needed.
‍

Example of use of LMM confidence in document processing workflows in Cradl AI

What makes a great Human-in-the-Loop UI?

An effective HITL interface allows users to quickly verify or correct AI predictions—creating a smooth rhythm between human oversight and machine automation.

A great UI should enable users to:

View predictions side-by-side with the source document
Instantly see confidence scores
Accept, edit, or override predictions with ease
Provide feedback to improve the model over time

Key Features to Include:

Highlighting: Link extracted fields directly to document text
Inline Editing: Make quick corrections with minimal friction
Validation Tools: Flag conflicts or edge cases that need attention
Feedback Loops: Use human corrections to continuously retrain the model and reduce future errors

Example human review user interface in Cradl AI.

The Payoff: Smart, scalable automation

Let's say your system processes 100 invoices per day. If 90% of fields are confidently predicted:

90 documents flow through untouched
10 go to human reviewers
You cut manual workload by at least 90%, without compromising on accuracy

And as your model improves with retraining, the need for human intervention continues to shrink. Over time, confidence thresholds can be raised, pushing the system closer to fully autonomous operation.

Incremental autonomy is the way forward

It's likely that one day, document processing will be almost entirely automated. But today, blind faith in AI isn't a strategy, especially for business-critical workflows.

Instead, businesses should pursue incremental autonomy: build systems that loop in humans where it matters, and continuously learn from their input. In short, AI needs oversight before it earns trust, whether it's driving a car or extracting data from a contract. The stakes may differ, but the principle is the same: until AI proves it can consistently handle edge cases, humans need to stay in the loop. By using confidence estimates to smartly route uncertain predictions to human experts, we can combine the speed and scale of LLMs with the judgment and nuance of humans.

‍