Logo

2025-07-09

Jerry PDF Service - Custom OCR Microservice

Built FastAPI microservice for PDF-to-image conversion with intelligent caching for N8N automation workflows. Features SHA256 hashing, translation storage, and API security for OCR pipelines.

FastAPI FrameworkPDF2Image + PopplerSQLite DatabaseDocker Development EnvironmentSHA256 Content Hashing
Jerry PDF Service API documentation interface

FastAPI auto-generated documentation showing OCR service endpoints

📋 Overview

Developed a custom FastAPI microservice to solve PDF translation workflow challenges in N8N automation. The service converts PDF documents to high-quality images with intelligent SHA256-based caching, eliminating redundant processing and enabling seamless integration with external translation services like Google Gemini.

🎯 Challenge

Needed efficient PDF-to-image conversion for N8N translation workflows with intelligent caching to avoid reprocessing identical content. Required secure API access, proper error handling, and optimized performance for automated translation pipelines while maintaining data persistence across container restarts.

💡 Solution

Built a containerized FastAPI microservice with PDF2Image conversion, SQLite caching database, and SHA256 hashing for intelligent duplicate detection. Implemented API key authentication, hot reload development environment, and structured endpoints for N8N workflow integration.

🛠️ Technologies

FastAPI Framework

High-performance Python web framework with automatic API documentation and async support

PDF2Image + Poppler

PDF processing library with system-level Poppler utilities for reliable document conversion

SQLite Database

Lightweight caching database for translation storage with SHA256 hash indexing

Docker Development Environment

Containerized deployment with hot reload capabilities and volume persistence

SHA256 Content Hashing

Cryptographic hashing for intelligent duplicate detection and cache optimization

✨ Key Features

🔄 **Intelligent Caching System**: SHA256 hashing prevents reprocessing identical PDF pages

⚡ **High-Performance Conversion**: 150 DPI JPEG output optimized for OCR accuracy

🔐 **API Security**: X-API-Key authentication protecting all endpoints from unauthorized access

📊 **SQLite Storage**: Persistent translation cache with timestamp tracking

🐳 **Development Environment**: Hot reload Docker setup with volume mounting for rapid iteration

📡 **N8N Integration**: Purpose-built endpoints for seamless automation workflow integration

🗃️ **Base64 Encoding**: Direct image data transfer without temporary file storage

💾 **Data Persistence**: Volume-mounted database surviving container restarts

📝 **Auto Documentation**: FastAPI automatic OpenAPI documentation at /docs endpoint

🔍 **Hash-based Lookup**: Efficient cache checking for pre-processed content

🏗️ Technical Highlights

Intelligent Caching Architecture

Implemented SHA256-based content hashing to identify identical PDF pages across different documents, dramatically reducing processing time and computational overhead for repeated content in automated workflows.

N8N Workflow Integration

Designed API endpoints specifically for N8N automation workflows, enabling seamless PDF upload, image retrieval, and translation caching in automated document processing pipelines.

Development-Optimized Container Setup

Created Docker Compose configuration with hot reload capabilities, volume mounting for code changes, and proper environment variable management for rapid development and testing cycles.

Production-Ready Security Model

Implemented comprehensive API key authentication across all endpoints with environment-based configuration and conditional documentation exposure for development vs production environments.

💻 Implementation

FastAPI Service Architecture

from fastapi import FastAPI, UploadFile, File, Security, HTTPException
from fastapi.security import APIKeyHeader
from pdf2image import convert_from_bytes
import hashlib, base64, sqlite3

# API Security Setup
API_KEY_NAME = "X-API-Key"
api_key_header = APIKeyHeader(name=API_KEY_NAME, auto_error=False)
SECRET_API_KEY = os.environ.get("API_KEY")

app = FastAPI(
    title="PDF Processing API",
    dependencies=[Security(get_api_key)]  # Global protection
)

@app.post("/process_pdf")
async def process_pdf(file: UploadFile = File(...)):
    pdf_bytes = await file.read()
    images = convert_from_bytes(pdf_bytes, dpi=150, fmt="jpeg")
    
    results = []
    for i, img in enumerate(images):
        # Convert to bytes and generate hash
        img_bytes_io = io.BytesIO()
        img.save(img_bytes_io, format="JPEG", quality=85)
        img_bytes = img_bytes_io.getvalue()
        page_hash = hashlib.sha256(img_bytes).hexdigest()
        
        # Check cache for existing translation
        cached_result = check_cache(page_hash)
        
        page_data = {
            "page_number": i + 1,
            "page_hash": page_hash,
            "cached": bool(cached_result),
            "translation": cached_result[0] if cached_result else None,
            "image_base64": None if cached_result else base64.b64encode(img_bytes).decode()
        }
        results.append(page_data)
    
    return {"pages": results}

Docker Development Configuration

# docker-compose.dev.yml
version: '3.7'

services:
  pdf-service:
    build:
      context: .
    container_name: jerry-pdf-service-dev
    restart: unless-stopped
    ports:
      - "8001:8000"  # Expose on homelab network
    environment:
      - ENVIRONMENT=${ENVIRONMENT}
      - API_KEY=${API_KEY}
    volumes:
      - ./data:/app/data          # Persistent database
      - ./main.py:/app/main.py    # Hot reload capability
    command: uvicorn main:app --host 0.0.0.0 --port 8000 --reload

Intelligent Caching System

# SQLite caching with SHA256 optimization
def setup_database():
    conn = sqlite3.connect(DB_FILE)
    cursor = conn.cursor()
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS page_translations (
            page_hash TEXT PRIMARY KEY,           -- SHA256 of image content
            translation TEXT,                     -- Cached translation result
            translated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.commit()
    conn.close()

@app.post("/cache_translation")
async def cache_translation(entry: CacheEntry):
    """Store translation result for future identical pages"""
    conn = sqlite3.connect(DB_FILE)
    cursor = conn.cursor()
    cursor.execute(
        "INSERT OR REPLACE INTO page_translations (page_hash, translation) VALUES (?, ?)",
        (entry.page_hash, entry.translation)
    )
    conn.commit()
    conn.close()
    return {"status": "success", "page_hash": entry.page_hash}

@app.post("/check_cache_by_hash")
async def check_cache_by_hash(item: HashCheck):
    """N8N endpoint for checking cached translations"""
    conn = sqlite3.connect(DB_FILE)
    cursor = conn.cursor()
    cursor.execute("SELECT translation FROM page_translations WHERE page_hash = ?", (item.page_hash,))
    cached_result = cursor.fetchone()
    conn.close()
    
    return {
        "cached": bool(cached_result),
        "translation": cached_result[0] if cached_result else None,
        "page_hash": item.page_hash
    }

🚀 Deployment Specs

PlatformUbuntu Server 192.168.1.30
ContainerPython 3.10-slim with Poppler utilities
FrameworkFastAPI with Uvicorn ASGI server
Port8001:8000 (external:internal)
DatabaseSQLite with volume persistence
DevelopmentHot reload with volume mounting
Image Processing150 DPI, 85% JPEG quality
SecurityAPI key authentication via X-API-Key header
Dependenciespdf2image, Pillow, FastAPI, SQLite3

📸 Gallery

N8N workflow diagram with PDF service integration

N8N automation workflow using Jerry PDF Service for document processing

Docker development environment with hot reload

Development setup showing containerized service with volume mounting

SQLite database showing cached translations

Database structure showing SHA256 hashes and cached translation data

🎓 Key Learnings

  • 📚FastAPI framework architecture and async programming patterns
  • 📚PDF processing with Python libraries and system dependencies
  • 📚Microservice design patterns for automation workflow integration
  • 📚Docker development environments with hot reload capabilities
  • 📚SHA256 hashing for content deduplication and cache optimization
  • 📚SQLite database design for high-performance caching systems
  • 📚API security implementation with environment-based configurations