Project Showcase

Document Extraction Pipeline

Turn Documents into Structured Data with AI

A production-ready FastAPI service that extracts structured data from PDFs and images using OCR and LLM technology. Features async processing, JWT authentication, and support for multiple document types including invoices, legal documents, and ESG reports.

View on GitHub

Tech Stack

FastAPICeleryPostgreSQLRedisMinIOOpenAIOllamaDockerMinerU

Key Features

Multi-Format Support

Upload PDF or image files (PNG, JPG, TIFF) up to 10MB. Automatic format detection and preprocessing.

Async Processing

Non-blocking API calls with Celery workers. Submit documents and poll for results without waiting.

OCR + LLM Extraction

MinerU (vlm-auto-engine) converts documents into structured Markdown preserving tables and layouts, then OpenAI GPT-4o-mini or local Ollama models extract structured fields.

Structured Output

Pre-built schemas for Invoice, Legal, and ESG documents with customizable extraction templates.

Architecture

FastAPI for high-performance async API endpoints

Celery workers for background document processing

PostgreSQL with JSONB for flexible result storage

Redis as Celery broker and caching layer

MinIO for S3-compatible object storage

JWT authentication with per-user rate limiting

Diagrams

Document Extraction Pipeline architecture — FastAPI + JWT accepts uploads, persists to MinIO, enqueues Celery jobs in Redis. Workers run MinerU OCR, hand off to an LLM for schema-driven extraction, and store results as JSONB in Postgres.

Document Extraction request flow — Async request life-cycle: upload returns a job id immediately; a worker fetches the file, runs OCR + LLM extraction, and persists the structured result. Clients poll the result endpoint until status = completed.

API Usage

Upload & Extract

# Upload a document
curl -X POST http://localhost:8000/documents/upload \
  -H "Authorization: Bearer <token>" \
  -F "file=@invoice.pdf" \
  -F "schema=invoice"

# Response: {"job_id": "uuid", "status": "queued"}

# Poll for results
curl http://localhost:8000/documents/result/{job_id} \
  -H "Authorization: Bearer <token>"

# Result: Structured JSON with extracted fields

Impact

Processes documents 10x faster than manual data entry with 95%+ accuracy on structured extractions.

10x

Faster Processing

95%+

Extraction Accuracy

Document Types