OCR Service

Production-ready microservice for extracting text from PDF files (using OCR) and DOCX documents.

Features

FastAPI-based HTTP API with automatic OpenAPI documentation
PDF OCR processing: Uses OCRmyPDF + Tesseract (no external APIs)
DOCX text extraction: Direct text extraction from Word documents
Multi-language support: English, German, French (configurable for PDF OCR)
Health checks: /health and /ready endpoints

Quick Start

Using Docker Compose

Build and start the service:
```
docker compose up --build
```
Test the health endpoint:
```
curl http://localhost:8010/health
```
Test the ready endpoint:
```
curl http://localhost:8010/ready
```

Process a PDF:

curl -X POST "http://localhost:8010/v1/ocr" \
  -F "file=@/path/to/your/document.pdf" \
  -F "lang=eng+deu" \
  -F "force_ocr=false"

Process a DOCX:

curl -X POST "http://localhost:8010/v1/ocr" \
  -F "file=@/path/to/your/document.docx"

Using Docker directly

docker build -t ocr-service .
docker run -p 8010:8000 \
  -e OCR_LANGUAGES=eng+deu+fra \
  -e OCR_MAX_CONCURRENCY=2 \
  ocr-service

API Endpoints

Health Check

GET /health

Returns basic health status.

Response:

{
  "status": "ok"
}

Readiness Check

GET /ready

Checks that all required binaries (ocrmypdf, tesseract, pdftotext) are available.

Response:

{
  "status": "ok",
  "binaries": ["ocrmypdf", "tesseract", "pdftotext"]
}

Error Response (503):

{
  "detail": "Service not ready: missing binaries: ocrmypdf"
}

Text Extraction

POST /v1/ocr

Extract text from PDF files (using OCR) or DOCX documents.

Supported formats:

PDF: Performs OCR using OCRmyPDF + Tesseract
DOCX: Direct text extraction from document structure

Request:

Content-Type: multipart/form-data
Field: file (PDF or DOCX file)
Query parameters:
- lang (optional): Override OCR languages for PDF (e.g., eng+deu+fra). Ignored for DOCX.
- force_ocr (optional, bool): Force OCR for PDF even if text layer exists (default: false). Ignored for DOCX.

Response (PDF):

{
  "text": "Extracted text content...",
  "meta": {
    "pages": 5,
    "lang": "eng+deu+fra",
    "engine": "ocrmypdf+tesseract",
    "timings_ms": {
      "ocr": 1234,
      "extract": 56,
      "total": 1290
    }
  }
}

Response (DOCX):

{
  "text": "Extracted text content...",
  "meta": {
    "pages": 3,
    "lang": null,
    "engine": "python-docx",
    "timings_ms": {
      "ocr": 0,
      "extract": 45,
      "total": 45
    }
  }
}

Error Responses:

400: Invalid file (not PDF/DOCX, exceeds size limit, invalid language format)
500: Processing failed

Configuration

All configuration is done via environment variables:

Server Configuration

APP_PORT (default: 8000): Port the service listens on inside the container
APP_HOST (default: 0.0.0.0): Host to bind to

OCR Configuration

OCR_LANGUAGES (default: eng+deu+fra): Languages for OCR (format: eng+deu+fra)
OCR_MAX_CONCURRENCY (default: 2): Maximum concurrent OCR jobs
OCR_SUBPROCESS_TIMEOUT_SECONDS (default: 300): Timeout for OCR subprocess (5 minutes)
EXTRACT_SUBPROCESS_TIMEOUT_SECONDS (default: 60): Timeout for text extraction subprocess

Upload Configuration

MAX_UPLOAD_MB (default: 100): Maximum upload size in megabytes

CORS Configuration

CORS_ORIGINS (default: *): Allowed CORS origins
- Use * to allow all origins (default for local development)
- Use comma-separated list for production: https://example.com,https://app.example.com

Port Configuration (Docker Compose)

OCR_HOST_PORT (default: 8010): Host port to map to container port 8000

Examples

Basic OCR Request

curl -X POST "http://localhost:8010/v1/ocr" \
  -F "file=@document.pdf"

Force OCR (re-process even if text exists)

curl -X POST "http://localhost:8010/v1/ocr?force_ocr=true" \
  -F "file=@document.pdf"

Override Languages (PDF only)

curl -X POST "http://localhost:8010/v1/ocr?lang=eng+fra" \
  -F "file=@document.pdf"

Process DOCX Document

curl -X POST "http://localhost:8010/v1/ocr" \
  -F "file=@document.docx"

Using Python requests

import requests

# Process PDF
url = "http://localhost:8010/v1/ocr"
files = {"file": open("document.pdf", "rb")}
params = {"lang": "eng+deu", "force_ocr": False}

response = requests.post(url, files=files, params=params)
result = response.json()

print(result["text"])
print(f"Processing time: {result['meta']['timings_ms']['total']}ms")
print(f"Engine: {result['meta']['engine']}")

# Process DOCX
files = {"file": open("document.docx", "rb")}
response = requests.post(url, files=files)
result = response.json()
print(result["text"])

Architecture

app/
├── main.py              # FastAPI application, middleware setup
├── config.py            # Pydantic settings from environment
├── routes/
│   ├── health.py        # Health and readiness endpoints
│   └── ocr.py           # Text extraction endpoint (PDF/DOCX)
├── services/
│   └── ocr_service.py   # OCR and text extraction orchestration
└── utils/
    ├── logging.py       # Structured logging and request ID middleware
    ├── subprocess_runner.py  # Subprocess execution with timeouts
    └── docx_extractor.py    # DOCX text extraction utility

Features in Detail

Concurrency Control

The service uses an asyncio.Semaphore to limit concurrent OCR jobs, preventing CPU exhaustion. Configure via OCR_MAX_CONCURRENCY.

Request Correlation

Each request gets a unique UUID that is:

Logged with all log messages
Returned in the X-Request-ID response header

Error Sanitization

Error messages are sanitized to remove internal paths and temporary directory names, preventing information leakage.

Timeouts

All subprocess calls have configurable timeouts to prevent hanging processes.

Temporary Files

Uses Python's tempfile.TemporaryDirectory for safe temporary file handling with automatic cleanup.

Development

Local Development (without Docker)

Install system dependencies (Ubuntu/Debian):

sudo apt-get install ocrmypdf tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu tesseract-ocr-fra ghostscript qpdf poppler-utils unpaper

Install Python dependencies:
```
pip install -r requirements.txt
```
Run the service:
```
python -m app.main
```

OpenAPI Documentation

Once the service is running, visit:

Swagger UI: http://localhost:8010/docs
ReDoc: http://localhost:8010/redoc

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
app		app
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

OCR Service

Features

Quick Start

Using Docker Compose

Using Docker directly

API Endpoints

Health Check

Readiness Check

Text Extraction

Configuration

Server Configuration

OCR Configuration

Upload Configuration

CORS Configuration

Port Configuration (Docker Compose)

Examples

Basic OCR Request

Force OCR (re-process even if text exists)

Override Languages (PDF only)

Process DOCX Document

Using Python requests

Architecture

Features in Detail

Concurrency Control

Request Correlation

Error Sanitization

Timeouts

Temporary Files

Development

Local Development (without Docker)

OpenAPI Documentation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages