2025-07-09
Jerry PDF Service - Custom OCR Microservice
Built FastAPI microservice for PDF-to-image conversion with intelligent caching for N8N automation workflows. Features SHA256 hashing, translation storage, and API security for OCR pipelines.

FastAPI auto-generated documentation showing OCR service endpoints
📋 Overview
Developed a custom FastAPI microservice to solve PDF translation workflow challenges in N8N automation. The service converts PDF documents to high-quality images with intelligent SHA256-based caching, eliminating redundant processing and enabling seamless integration with external translation services like Google Gemini.
🎯 Challenge
Needed efficient PDF-to-image conversion for N8N translation workflows with intelligent caching to avoid reprocessing identical content. Required secure API access, proper error handling, and optimized performance for automated translation pipelines while maintaining data persistence across container restarts.
💡 Solution
Built a containerized FastAPI microservice with PDF2Image conversion, SQLite caching database, and SHA256 hashing for intelligent duplicate detection. Implemented API key authentication, hot reload development environment, and structured endpoints for N8N workflow integration.
🛠️ Technologies
FastAPI Framework
High-performance Python web framework with automatic API documentation and async support
PDF2Image + Poppler
PDF processing library with system-level Poppler utilities for reliable document conversion
SQLite Database
Lightweight caching database for translation storage with SHA256 hash indexing
Docker Development Environment
Containerized deployment with hot reload capabilities and volume persistence
SHA256 Content Hashing
Cryptographic hashing for intelligent duplicate detection and cache optimization
✨ Key Features
🔄 **Intelligent Caching System**: SHA256 hashing prevents reprocessing identical PDF pages
⚡ **High-Performance Conversion**: 150 DPI JPEG output optimized for OCR accuracy
🔐 **API Security**: X-API-Key authentication protecting all endpoints from unauthorized access
📊 **SQLite Storage**: Persistent translation cache with timestamp tracking
🐳 **Development Environment**: Hot reload Docker setup with volume mounting for rapid iteration
📡 **N8N Integration**: Purpose-built endpoints for seamless automation workflow integration
🗃️ **Base64 Encoding**: Direct image data transfer without temporary file storage
💾 **Data Persistence**: Volume-mounted database surviving container restarts
📝 **Auto Documentation**: FastAPI automatic OpenAPI documentation at /docs endpoint
🔍 **Hash-based Lookup**: Efficient cache checking for pre-processed content
🏗️ Technical Highlights
Intelligent Caching Architecture
Implemented SHA256-based content hashing to identify identical PDF pages across different documents, dramatically reducing processing time and computational overhead for repeated content in automated workflows.
N8N Workflow Integration
Designed API endpoints specifically for N8N automation workflows, enabling seamless PDF upload, image retrieval, and translation caching in automated document processing pipelines.
Development-Optimized Container Setup
Created Docker Compose configuration with hot reload capabilities, volume mounting for code changes, and proper environment variable management for rapid development and testing cycles.
Production-Ready Security Model
Implemented comprehensive API key authentication across all endpoints with environment-based configuration and conditional documentation exposure for development vs production environments.
💻 Implementation
FastAPI Service Architecture
from fastapi import FastAPI, UploadFile, File, Security, HTTPException
from fastapi.security import APIKeyHeader
from pdf2image import convert_from_bytes
import hashlib, base64, sqlite3
# API Security Setup
API_KEY_NAME = "X-API-Key"
api_key_header = APIKeyHeader(name=API_KEY_NAME, auto_error=False)
SECRET_API_KEY = os.environ.get("API_KEY")
app = FastAPI(
title="PDF Processing API",
dependencies=[Security(get_api_key)] # Global protection
)
@app.post("/process_pdf")
async def process_pdf(file: UploadFile = File(...)):
pdf_bytes = await file.read()
images = convert_from_bytes(pdf_bytes, dpi=150, fmt="jpeg")
results = []
for i, img in enumerate(images):
# Convert to bytes and generate hash
img_bytes_io = io.BytesIO()
img.save(img_bytes_io, format="JPEG", quality=85)
img_bytes = img_bytes_io.getvalue()
page_hash = hashlib.sha256(img_bytes).hexdigest()
# Check cache for existing translation
cached_result = check_cache(page_hash)
page_data = {
"page_number": i + 1,
"page_hash": page_hash,
"cached": bool(cached_result),
"translation": cached_result[0] if cached_result else None,
"image_base64": None if cached_result else base64.b64encode(img_bytes).decode()
}
results.append(page_data)
return {"pages": results}Docker Development Configuration
# docker-compose.dev.yml
version: '3.7'
services:
pdf-service:
build:
context: .
container_name: jerry-pdf-service-dev
restart: unless-stopped
ports:
- "8001:8000" # Expose on homelab network
environment:
- ENVIRONMENT=${ENVIRONMENT}
- API_KEY=${API_KEY}
volumes:
- ./data:/app/data # Persistent database
- ./main.py:/app/main.py # Hot reload capability
command: uvicorn main:app --host 0.0.0.0 --port 8000 --reloadIntelligent Caching System
# SQLite caching with SHA256 optimization
def setup_database():
conn = sqlite3.connect(DB_FILE)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS page_translations (
page_hash TEXT PRIMARY KEY, -- SHA256 of image content
translation TEXT, -- Cached translation result
translated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.commit()
conn.close()
@app.post("/cache_translation")
async def cache_translation(entry: CacheEntry):
"""Store translation result for future identical pages"""
conn = sqlite3.connect(DB_FILE)
cursor = conn.cursor()
cursor.execute(
"INSERT OR REPLACE INTO page_translations (page_hash, translation) VALUES (?, ?)",
(entry.page_hash, entry.translation)
)
conn.commit()
conn.close()
return {"status": "success", "page_hash": entry.page_hash}
@app.post("/check_cache_by_hash")
async def check_cache_by_hash(item: HashCheck):
"""N8N endpoint for checking cached translations"""
conn = sqlite3.connect(DB_FILE)
cursor = conn.cursor()
cursor.execute("SELECT translation FROM page_translations WHERE page_hash = ?", (item.page_hash,))
cached_result = cursor.fetchone()
conn.close()
return {
"cached": bool(cached_result),
"translation": cached_result[0] if cached_result else None,
"page_hash": item.page_hash
}🚀 Deployment Specs
| Platform | Ubuntu Server 192.168.1.30 |
| Container | Python 3.10-slim with Poppler utilities |
| Framework | FastAPI with Uvicorn ASGI server |
| Port | 8001:8000 (external:internal) |
| Database | SQLite with volume persistence |
| Development | Hot reload with volume mounting |
| Image Processing | 150 DPI, 85% JPEG quality |
| Security | API key authentication via X-API-Key header |
| Dependencies | pdf2image, Pillow, FastAPI, SQLite3 |
📸 Gallery

N8N automation workflow using Jerry PDF Service for document processing

Development setup showing containerized service with volume mounting

Database structure showing SHA256 hashes and cached translation data
🎓 Key Learnings
- 📚FastAPI framework architecture and async programming patterns
- 📚PDF processing with Python libraries and system dependencies
- 📚Microservice design patterns for automation workflow integration
- 📚Docker development environments with hot reload capabilities
- 📚SHA256 hashing for content deduplication and cache optimization
- 📚SQLite database design for high-performance caching systems
- 📚API security implementation with environment-based configurations
