Project Showcase
Document Extraction Pipeline
Turn Documents into Structured Data with AI
A production-ready FastAPI service that extracts structured data from PDFs and images using OCR and LLM technology. Features async processing, JWT authentication, and support for multiple document types including invoices, legal documents, and ESG reports.
Tech Stack
Key Features
Multi-Format Support
Upload PDF or image files (PNG, JPG, TIFF) up to 10MB. Automatic format detection and preprocessing.
Async Processing
Non-blocking API calls with Celery workers. Submit documents and poll for results without waiting.
OCR + LLM Extraction
MinerU (vlm-auto-engine) converts documents into structured Markdown preserving tables and layouts, then OpenAI GPT-4o-mini or local Ollama models extract structured fields.
Structured Output
Pre-built schemas for Invoice, Legal, and ESG documents with customizable extraction templates.
Architecture
Diagrams
API Usage
# Upload a document
curl -X POST http://localhost:8000/documents/upload \
-H "Authorization: Bearer <token>" \
-F "file=@invoice.pdf" \
-F "schema=invoice"
# Response: {"job_id": "uuid", "status": "queued"}
# Poll for results
curl http://localhost:8000/documents/result/{job_id} \
-H "Authorization: Bearer <token>"
# Result: Structured JSON with extracted fieldsImpact
Processes documents 10x faster than manual data entry with 95%+ accuracy on structured extractions.
10x
Faster Processing
95%+
Extraction Accuracy
3
Document Types