In recent years, Large Language Models (LLMs) like GPT-4, Claude 2, and Gemini have made impressive advances in extracting and understanding data from complex documents such as contracts, invoices, and receipts. But like even the sharpest human analyst, AI can and will make mistakes. In business-critical workflows, where there is little room for errors, relying on a fully autonomous system isn't just risky- it's irresponsible. That's why businesses should adopt AI agents with partial autonomy: AI systems that include humans in the loop.
Last week, Andrej Karpathy, former director of AI at Tesla and a founding member of OpenAI, gave a talk at Y Combinator's AI Startup School where he reflected on the importance of human-AI collaboration. He recalled when a friend working at Google's self-driving company, now Waymo, gave him a ride in a self-driving car in 2013:
We got into this car and we went for an about 30-minute drive around Palo Alto, highways, streets and so on, and that drive was perfect. Zero intervention. And this was 2013, which is now 12 years ago. It kind of struck me because at the time when I had this perfect drive, this perfect demo, I felt like "self-driving is imminent because this just works. This is incredible." But here we are, 12 years later, and we are still working on autonomy. We are still working on driving (AI) agents. Even now, we haven't actually solved the problem.
Karpathy emphasizes the need to keep "AI on a leash." When implementing AI solutions, he suggests building partially autonomous workflows with AI and human verification working hand-in-hand.
I would argue that keeping humans in the loop is almost just as important in document processing as in self-driving cars...
Modern LLMs can assign confidence scores to each prediction, indicating how certain the model is about its output. For example, results from a processed invoice can look like this:
"2025-05-10" –
98% confidence
"HPL Technologies" –
95% confidence
"$3,287.99" –
57% confidence
By setting a confidence threshold, businesses can route only low-confidence predictions to human reviewers, letting the system handle the rest autonomously.
This selective review approach offers two major benefits:
An effective HITL interface allows users to quickly verify or correct AI predictions—creating a smooth rhythm between human oversight and machine automation.
A great UI should enable users to:
Let's say your system processes 100 invoices per day. If 90% of fields are confidently predicted:
And as your model improves with retraining, the need for human intervention continues to shrink. Over time, confidence thresholds can be raised, pushing the system closer to fully autonomous operation.
It's likely that one day, document processing will be almost entirely automated. But today, blind faith in AI isn't a strategy, especially for business-critical workflows.
Instead, businesses should pursue incremental autonomy: build systems that loop in humans where it matters, and continuously learn from their input. In short, AI needs oversight before it earns trust, whether it's driving a car or extracting data from a contract. The stakes may differ, but the principle is the same: until AI proves it can consistently handle edge cases, humans need to stay in the loop. By using confidence estimates to smartly route uncertain predictions to human experts, we can combine the speed and scale of LLMs with the judgment and nuance of humans.
We’ll help get you started with your document automation journey.
Schedule a free demo today!