Skip to content

Kolot-lu/ocr-service

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OCR Service

Production-ready microservice for extracting text from PDF files (using OCR) and DOCX documents.

Features

  • FastAPI-based HTTP API with automatic OpenAPI documentation
  • PDF OCR processing: Uses OCRmyPDF + Tesseract (no external APIs)
  • DOCX text extraction: Direct text extraction from Word documents
  • Multi-language support: English, German, French (configurable for PDF OCR)
  • Health checks: /health and /ready endpoints

Quick Start

Using Docker Compose

  1. Build and start the service:

    docker compose up --build
  2. Test the health endpoint:

    curl http://localhost:8010/health
  3. Test the ready endpoint:

    curl http://localhost:8010/ready
  4. Process a PDF:

    curl -X POST "http://localhost:8010/v1/ocr" \
      -F "file=@/path/to/your/document.pdf" \
      -F "lang=eng+deu" \
      -F "force_ocr=false"
  5. Process a DOCX:

    curl -X POST "http://localhost:8010/v1/ocr" \
      -F "file=@/path/to/your/document.docx"

Using Docker directly

docker build -t ocr-service .
docker run -p 8010:8000 \
  -e OCR_LANGUAGES=eng+deu+fra \
  -e OCR_MAX_CONCURRENCY=2 \
  ocr-service

API Endpoints

Health Check

GET /health

Returns basic health status.

Response:

{
  "status": "ok"
}

Readiness Check

GET /ready

Checks that all required binaries (ocrmypdf, tesseract, pdftotext) are available.

Response:

{
  "status": "ok",
  "binaries": ["ocrmypdf", "tesseract", "pdftotext"]
}

Error Response (503):

{
  "detail": "Service not ready: missing binaries: ocrmypdf"
}

Text Extraction

POST /v1/ocr

Extract text from PDF files (using OCR) or DOCX documents.

Supported formats:

  • PDF: Performs OCR using OCRmyPDF + Tesseract
  • DOCX: Direct text extraction from document structure

Request:

  • Content-Type: multipart/form-data
  • Field: file (PDF or DOCX file)
  • Query parameters:
    • lang (optional): Override OCR languages for PDF (e.g., eng+deu+fra). Ignored for DOCX.
    • force_ocr (optional, bool): Force OCR for PDF even if text layer exists (default: false). Ignored for DOCX.

Response (PDF):

{
  "text": "Extracted text content...",
  "meta": {
    "pages": 5,
    "lang": "eng+deu+fra",
    "engine": "ocrmypdf+tesseract",
    "timings_ms": {
      "ocr": 1234,
      "extract": 56,
      "total": 1290
    }
  }
}

Response (DOCX):

{
  "text": "Extracted text content...",
  "meta": {
    "pages": 3,
    "lang": null,
    "engine": "python-docx",
    "timings_ms": {
      "ocr": 0,
      "extract": 45,
      "total": 45
    }
  }
}

Error Responses:

  • 400: Invalid file (not PDF/DOCX, exceeds size limit, invalid language format)
  • 500: Processing failed

Configuration

All configuration is done via environment variables:

Server Configuration

  • APP_PORT (default: 8000): Port the service listens on inside the container
  • APP_HOST (default: 0.0.0.0): Host to bind to

OCR Configuration

  • OCR_LANGUAGES (default: eng+deu+fra): Languages for OCR (format: eng+deu+fra)
  • OCR_MAX_CONCURRENCY (default: 2): Maximum concurrent OCR jobs
  • OCR_SUBPROCESS_TIMEOUT_SECONDS (default: 300): Timeout for OCR subprocess (5 minutes)
  • EXTRACT_SUBPROCESS_TIMEOUT_SECONDS (default: 60): Timeout for text extraction subprocess

Upload Configuration

  • MAX_UPLOAD_MB (default: 100): Maximum upload size in megabytes

CORS Configuration

  • CORS_ORIGINS (default: *): Allowed CORS origins
    • Use * to allow all origins (default for local development)
    • Use comma-separated list for production: https://example.com,https://app.example.com

Port Configuration (Docker Compose)

  • OCR_HOST_PORT (default: 8010): Host port to map to container port 8000

Examples

Basic OCR Request

curl -X POST "http://localhost:8010/v1/ocr" \
  -F "file=@document.pdf"

Force OCR (re-process even if text exists)

curl -X POST "http://localhost:8010/v1/ocr?force_ocr=true" \
  -F "file=@document.pdf"

Override Languages (PDF only)

curl -X POST "http://localhost:8010/v1/ocr?lang=eng+fra" \
  -F "file=@document.pdf"

Process DOCX Document

curl -X POST "http://localhost:8010/v1/ocr" \
  -F "file=@document.docx"

Using Python requests

import requests

# Process PDF
url = "http://localhost:8010/v1/ocr"
files = {"file": open("document.pdf", "rb")}
params = {"lang": "eng+deu", "force_ocr": False}

response = requests.post(url, files=files, params=params)
result = response.json()

print(result["text"])
print(f"Processing time: {result['meta']['timings_ms']['total']}ms")
print(f"Engine: {result['meta']['engine']}")

# Process DOCX
files = {"file": open("document.docx", "rb")}
response = requests.post(url, files=files)
result = response.json()
print(result["text"])

Architecture

app/
├── main.py              # FastAPI application, middleware setup
├── config.py            # Pydantic settings from environment
├── routes/
│   ├── health.py        # Health and readiness endpoints
│   └── ocr.py           # Text extraction endpoint (PDF/DOCX)
├── services/
│   └── ocr_service.py   # OCR and text extraction orchestration
└── utils/
    ├── logging.py       # Structured logging and request ID middleware
    ├── subprocess_runner.py  # Subprocess execution with timeouts
    └── docx_extractor.py    # DOCX text extraction utility

Features in Detail

Concurrency Control

The service uses an asyncio.Semaphore to limit concurrent OCR jobs, preventing CPU exhaustion. Configure via OCR_MAX_CONCURRENCY.

Request Correlation

Each request gets a unique UUID that is:

  • Logged with all log messages
  • Returned in the X-Request-ID response header

Error Sanitization

Error messages are sanitized to remove internal paths and temporary directory names, preventing information leakage.

Timeouts

All subprocess calls have configurable timeouts to prevent hanging processes.

Temporary Files

Uses Python's tempfile.TemporaryDirectory for safe temporary file handling with automatic cleanup.

Development

Local Development (without Docker)

  1. Install system dependencies (Ubuntu/Debian):

    sudo apt-get install ocrmypdf tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu tesseract-ocr-fra ghostscript qpdf poppler-utils unpaper
  2. Install Python dependencies:

    pip install -r requirements.txt
  3. Run the service:

    python -m app.main

OpenAPI Documentation

Once the service is running, visit:

  • Swagger UI: http://localhost:8010/docs
  • ReDoc: http://localhost:8010/redoc

License

MIT

About

FastAPI microservice for extracting text from PDF files (OCR) and DOCX documents using OCRmyPDF, Tesseract, and python-docx. Docker-ready with health checks, concurrency control, and multi-language support.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors