Production-ready microservice for extracting text from PDF files (using OCR) and DOCX documents.
- FastAPI-based HTTP API with automatic OpenAPI documentation
- PDF OCR processing: Uses OCRmyPDF + Tesseract (no external APIs)
- DOCX text extraction: Direct text extraction from Word documents
- Multi-language support: English, German, French (configurable for PDF OCR)
- Health checks:
/healthand/readyendpoints
-
Build and start the service:
docker compose up --build
-
Test the health endpoint:
curl http://localhost:8010/health
-
Test the ready endpoint:
curl http://localhost:8010/ready
-
Process a PDF:
curl -X POST "http://localhost:8010/v1/ocr" \ -F "file=@/path/to/your/document.pdf" \ -F "lang=eng+deu" \ -F "force_ocr=false"
-
Process a DOCX:
curl -X POST "http://localhost:8010/v1/ocr" \ -F "file=@/path/to/your/document.docx"
docker build -t ocr-service .
docker run -p 8010:8000 \
-e OCR_LANGUAGES=eng+deu+fra \
-e OCR_MAX_CONCURRENCY=2 \
ocr-serviceGET /health
Returns basic health status.
Response:
{
"status": "ok"
}GET /ready
Checks that all required binaries (ocrmypdf, tesseract, pdftotext) are available.
Response:
{
"status": "ok",
"binaries": ["ocrmypdf", "tesseract", "pdftotext"]
}Error Response (503):
{
"detail": "Service not ready: missing binaries: ocrmypdf"
}POST /v1/ocr
Extract text from PDF files (using OCR) or DOCX documents.
Supported formats:
- PDF: Performs OCR using OCRmyPDF + Tesseract
- DOCX: Direct text extraction from document structure
Request:
- Content-Type:
multipart/form-data - Field:
file(PDF or DOCX file) - Query parameters:
lang(optional): Override OCR languages for PDF (e.g.,eng+deu+fra). Ignored for DOCX.force_ocr(optional, bool): Force OCR for PDF even if text layer exists (default:false). Ignored for DOCX.
Response (PDF):
{
"text": "Extracted text content...",
"meta": {
"pages": 5,
"lang": "eng+deu+fra",
"engine": "ocrmypdf+tesseract",
"timings_ms": {
"ocr": 1234,
"extract": 56,
"total": 1290
}
}
}Response (DOCX):
{
"text": "Extracted text content...",
"meta": {
"pages": 3,
"lang": null,
"engine": "python-docx",
"timings_ms": {
"ocr": 0,
"extract": 45,
"total": 45
}
}
}Error Responses:
400: Invalid file (not PDF/DOCX, exceeds size limit, invalid language format)500: Processing failed
All configuration is done via environment variables:
APP_PORT(default:8000): Port the service listens on inside the containerAPP_HOST(default:0.0.0.0): Host to bind to
OCR_LANGUAGES(default:eng+deu+fra): Languages for OCR (format:eng+deu+fra)OCR_MAX_CONCURRENCY(default:2): Maximum concurrent OCR jobsOCR_SUBPROCESS_TIMEOUT_SECONDS(default:300): Timeout for OCR subprocess (5 minutes)EXTRACT_SUBPROCESS_TIMEOUT_SECONDS(default:60): Timeout for text extraction subprocess
MAX_UPLOAD_MB(default:100): Maximum upload size in megabytes
CORS_ORIGINS(default:*): Allowed CORS origins- Use
*to allow all origins (default for local development) - Use comma-separated list for production:
https://example.com,https://app.example.com
- Use
OCR_HOST_PORT(default:8010): Host port to map to container port 8000
curl -X POST "http://localhost:8010/v1/ocr" \
-F "file=@document.pdf"curl -X POST "http://localhost:8010/v1/ocr?force_ocr=true" \
-F "file=@document.pdf"curl -X POST "http://localhost:8010/v1/ocr?lang=eng+fra" \
-F "file=@document.pdf"curl -X POST "http://localhost:8010/v1/ocr" \
-F "file=@document.docx"import requests
# Process PDF
url = "http://localhost:8010/v1/ocr"
files = {"file": open("document.pdf", "rb")}
params = {"lang": "eng+deu", "force_ocr": False}
response = requests.post(url, files=files, params=params)
result = response.json()
print(result["text"])
print(f"Processing time: {result['meta']['timings_ms']['total']}ms")
print(f"Engine: {result['meta']['engine']}")
# Process DOCX
files = {"file": open("document.docx", "rb")}
response = requests.post(url, files=files)
result = response.json()
print(result["text"])app/
├── main.py # FastAPI application, middleware setup
├── config.py # Pydantic settings from environment
├── routes/
│ ├── health.py # Health and readiness endpoints
│ └── ocr.py # Text extraction endpoint (PDF/DOCX)
├── services/
│ └── ocr_service.py # OCR and text extraction orchestration
└── utils/
├── logging.py # Structured logging and request ID middleware
├── subprocess_runner.py # Subprocess execution with timeouts
└── docx_extractor.py # DOCX text extraction utility
The service uses an asyncio.Semaphore to limit concurrent OCR jobs, preventing CPU exhaustion. Configure via OCR_MAX_CONCURRENCY.
Each request gets a unique UUID that is:
- Logged with all log messages
- Returned in the
X-Request-IDresponse header
Error messages are sanitized to remove internal paths and temporary directory names, preventing information leakage.
All subprocess calls have configurable timeouts to prevent hanging processes.
Uses Python's tempfile.TemporaryDirectory for safe temporary file handling with automatic cleanup.
-
Install system dependencies (Ubuntu/Debian):
sudo apt-get install ocrmypdf tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu tesseract-ocr-fra ghostscript qpdf poppler-utils unpaper
-
Install Python dependencies:
pip install -r requirements.txt
-
Run the service:
python -m app.main
Once the service is running, visit:
- Swagger UI:
http://localhost:8010/docs - ReDoc:
http://localhost:8010/redoc
MIT