Gratheon / image-splitter

Overview

The image-splitter microservice is a core component of the Gratheon platform responsible for processing images of beehive frames uploaded by users. Its primary function is to orchestrate the analysis of these images by invoking various internal and external machine learning models to detect key elements like bees (workers, drones, queens), brood cells, pollen, honey, queen cups, and varroa mites.

It receives uploaded images, stores them, manages a queue of analysis jobs, calls detection models, aggregates the results, and makes them available via the platform's federated GraphQL API. It also generates resized versions (thumbnails) of the original images.

Key Responsibilities:

Accepting direct image uploads (frame sides).
Storing original images and generated thumbnails in object storage (AWS S3/Minio).
Managing an asynchronous job queue (using MySQL + Redis Pub/Sub) for various detection tasks (resizing, bees, cells, queen cups, varroa, queens).
Orchestrating calls to internal ML models (models-bee-detector, models-frame-resources) and external services (Clarifai) for specific detections.
Storing detection results (bounding boxes, counts, percentages) in the database.
Publishing processing status updates via Redis Pub/Sub for real-time notifications.
Exposing processed data and initiating actions (like AI advice generation) through a federated GraphQL API.

Key Features

Direct, authenticated image upload handling.
Asynchronous, persistent, and retry-capable job processing for image analysis tasks.
Integration with multiple ML models for comprehensive frame analysis.
Generation of image thumbnails.
Storage of raw detection data (e.g., bounding boxes) and summary statistics.
Federated GraphQL endpoint for querying results and managing frame data.
Integration with Sentry for error monitoring.
Configurable storage backend (AWS S3 for production, Minio for development/testing).

Architecture

High-Level Service Diagram

graph LR
    subgraph User Interaction
        direction LR
        WebApp("<a href='https://github.com/Gratheon/web-app'>web-app</a>")
    end

    subgraph API Layer
        direction LR
        Router("<a href='https://github.com/Gratheon/graphql-router'>graphql-router</a>")
    end

    subgraph Image Processing
        direction TB
        ImageSplitter["<b>image-splitter</b><br/>(This Service)"]
        subgraph Storage
            direction TB
            MySQL[(MySQL DB)]
            Redis[(Redis)]
            S3(Object Storage<br/>AWS S3 / Minio)
        end
        subgraph ML Models
            direction TB
            BeeDetector("<a href='https://github.com/Gratheon/models-bee-detector'>models-bee-detector</a>")
            FrameResources("<a href='https://github.com/Gratheon/models-frame-resources'>models-frame-resources</a>")
            Clarifai(Clarifai API)
        end
    end

    subgraph Other Services
        SchemaRegistry("<a href='https://github.com/Gratheon/graphql-schema-registry'>graphql-schema-registry</a>")
    end


    WebApp -- GraphQL Queries/Mutations --> Router
    Router -- Federated GraphQL --> ImageSplitter
    WebApp -- Direct Image Upload --> ImageSplitter

    ImageSplitter -- Register Schema --> SchemaRegistry
    ImageSplitter -- Store/Retrieve Jobs & Results --> MySQL
    ImageSplitter -- Publish Events --> Redis
    ImageSplitter -- Store/Retrieve Images --> S3

    ImageSplitter -- Detect Bees --> BeeDetector
    ImageSplitter -- Detect Cells --> FrameResources
    ImageSplitter -- Detect Varroa/Queens/Cups --> Clarifai

    style ImageSplitter fill:#f9f,stroke:#333,stroke-width:2px

Technology Stack

Language: TypeScript
Framework: Fastify
API: GraphQL (Apollo Server v3, Apollo Federation v1)
Database: MySQL (@databases/mysql)
Job Queue: Hybrid MySQL + Redis Pub/Sub (custom implementation)
Cache/PubSub: Redis (ioredis)
Object Storage: AWS S3 / Minio (@aws-sdk/client-s3)
Image Processing: Sharp, webp-converter
ML Integrations: Clarifai gRPC (clarifai-nodejs-grpc), Internal REST APIs
Containerization: Docker, Docker Compose
Monitoring: Sentry (@sentry/node)
Testing: Jest

API Documentation

The service exposes a GraphQL endpoint, typically accessed via the federated graphql-router. However, for file uploads, clients interact directly with this service due to limitations in GraphQL federation handling multipart/form-data.

URLs

Development: http://localhost:8800/graphql
Production: https://image.gratheon.com/graphql (Note: Uploads go directly here, not via the main router)

Authentication:

Requests via the router are authenticated using an internal signature (internal-router-signature header).
Direct requests (like uploads) must include a JWT in the token header.

Key Mutations

uploadFrameSide(file: Upload!): File: Uploads a frame side image. Initiates the processing pipeline by creating relevant jobs (resize, detections).
addFileToFrameSide(frameSideId: ID!, fileId: ID!, hiveId: ID!): Boolean: Associates an uploaded file with a specific frame side and hive.
filesStrokeEditMutation(files: [FilesUpdateInput]): Boolean: Saves user-drawn annotations on an image.
updateFrameSideCells(cells: FrameSideCellsInput!): Boolean!: Allows manual correction of detected cell percentages.
confirmFrameSideQueen(frameSideId: ID!, isConfirmed: Boolean!): Boolean!: Manually confirms or denies the presence of a queen on a frame side.
generateHiveAdvice(hiveID: ID, adviceContext: JSON, langCode: String): String: Triggers a call to an external AI (e.g., GPT-4 via Clarifai) to generate advice based on hive data.
cloneFramesForInspection(frameSideIDs: [ID], inspectionId: ID!): Boolean!: Creates copies of frame side data associated with a new inspection record.

Key Queries

file(id:ID!): File: Retrieves details about a specific file, including its resizes.
hiveFiles(hiveId:ID!): [FrameSideFile]: Gets all frame side files associated with a hive.
hiveFrameSideFile(frameSideId:ID!): FrameSideFile: Gets detailed information for a specific frame side file, including detection results.
hiveFrameSideCells(frameSideId:ID!): FrameSideCells: Gets the cell composition statistics for a frame side.
frameSidesInspections(frameSideIds: [ID], inspectionId:ID!): [FrameSideInspection]: Retrieves data for specific frame sides within the context of an inspection.

Upload Sequence Diagram

sequenceDiagram
    participant Client
    participant ImageSplitter
    participant S3 [Object Storage]
    participant DB [(MySQL DB)]

    Client->>+ImageSplitter: POST /graphql (multipart/form-data)\nMutation: uploadFrameSide(file: Upload!)
    ImageSplitter->>ImageSplitter: Authenticate (JWT)
    ImageSplitter->>+S3: Store Original Image
    S3-->>-ImageSplitter: Stored URL/Path
    ImageSplitter->>+DB: INSERT INTO files (metadata, url)
    DB-->>-ImageSplitter: file_id
    ImageSplitter->>+DB: INSERT INTO jobs (type='resize', ref_id=file_id)
    DB-->>-ImageSplitter: job_id (resize)
    ImageSplitter-->>-Client: { data: { uploadFrameSide: { id: file_id, url: ... } } }
    Note right of ImageSplitter: Resize job queued for async processing.

(Note: Subsequent detection jobs are typically added after addFileToFrameSide associates the file with a frame side)

Database Schema

The service uses a MySQL database to store file metadata, job queue information, and detection results.

Key Tables:

files: Stores metadata about uploaded original images (filename, dimensions, user_id, S3 URL).
files_resized: Stores information about generated thumbnails/resizes for each file.
files_frame_side_rel: Join table linking files to frame sides. Stores detection results (JSON blobs for bees, cells, cups, varroa, queens), counts, and user annotations (strokeHistory). Also includes inspection_id.
files_frame_side_cells: Stores detailed cell counts and calculated percentages (brood, honey, pollen, etc.) for a frame side. Includes inspection_id.
files_frame_side_queen_cups: (Potentially deprecated/merged into files_frame_side_rel based on migrations) Previously stored queen cup detection status. Includes inspection_id.
jobs: Manages the asynchronous processing queue. Tracks job type (name), status (start/end times), retries (calls), errors, payload, priority (1-5, lower is higher priority), and the reference ID (ref_id) linking to the relevant entity (e.g., file_id).
files_hive_rel: Join table linking files to hives for box-type images (e.g., varroa bottom board images).

Entity-Relationship Diagram (ERD)

erDiagram
    files ||--o{ files_resized : "has"
    files ||--|{ files_frame_side_rel : "associated via"
    files_frame_side_rel }o--|| frame_side : "links to (external)"
    files_frame_side_rel ||--|{ files_frame_side_cells : "has details in"
    files ||--o{ jobs : "processed by (resize)"
    files_frame_side_rel ||--o{ jobs : "processed by (detections)"

    files {
        int id PK
        int user_id
        varchar filename
        varchar hash
        varchar ext
        int width
        int height
        varchar url "Generated S3/Minio URL"
        int url_version
    }

    files_resized {
        int id PK
        int file_id
        int max_dimension_px
        varchar url "Generated S3/Minio URL"
    }

    files_frame_side_rel {
        int frame_side_id PK
        int file_id PK
        int user_id
        int inspection_id
        json strokeHistory
        json detectedObjects "Legacy/Combined?"
        json detectedBees
        json detectedCells
        json detectedQueenCups
        json detectedVarroa
        json detectedQueens
        int workerCount
        int droneCount
        int queen_detected "Boolean flag"
        int varroaCount
        datetime added_time
    }

    files_frame_side_cells {
        int id PK "Auto-increment, links?"
        int frame_side_id
        int file_id
        int user_id
        int inspection_id
        int honey_cell_count
        int brood_cell_count
        int egg_cell_count
        int capped_brood_cell_count
        int pollen_cell_count
        int nectar_cell_count
        int empty_cell_count
        int broodPercent "Calculated/Manual"
        int cappedBroodPercent "Calculated/Manual"
        int eggsPercent "Calculated/Manual"
        int pollenPercent "Calculated/Manual"
        int honeyPercent "Calculated/Manual"
    }

    jobs {
        int id PK
        varchar name "'cells', 'bees', 'drones', 'cups', 'queens', 'varroa', 'varroa_bottom', 'resize', 'notify'"
        datetime process_start_time
        datetime last_retry_time
        int calls "Retry count"
        datetime process_end_time
        int ref_id "Ref files(id) or files_frame_side_rel(file_id)"
        json payload
        text error
        tinyint priority "1=High, 3=Medium, 5=Low (default)"
    }

    frame_side {
        int id PK
        string name "(External Entity)"
    }

Asynchronous Processing

The service utilizes a hybrid MySQL + Redis Pub/Sub job queue for handling time-consuming image analysis tasks asynchronously. This architecture eliminates database polling, reduces database load by 90-95%, and provides near-instant job processing.

Architecture Overview

Job Queue Design:

MySQL Database: Source of truth for job persistence, auditing, and recovery
Redis Pub/Sub: Instant notification system to wake up workers when jobs are added
Workers: Subscribe to Redis channels and process jobs on-demand

Performance Benefits:

~95% reduction in database queries (no more polling every 10ms-10s)
<100ms latency for job processing (vs 5 seconds with polling)
Instant notifications via Redis pub/sub
Job prioritization (high/medium/low) for optimal resource allocation
Rate limiting per job type to prevent API quota exhaustion

Job Types and Priorities

Priority	Job Types	Rate Limit	Use Case
1 (High)	`resize`, `notify`	None	User-blocking operations
3 (Medium)	`bees`, `drones`, `cells`	100ms	Local AI processing
5 (Low)	`varroa`, `varroa_bottom`, `queens`, `cups`	2000ms	External API calls (Clarifai)

Rate Limiting: External API jobs are limited to 30 calls/minute (2000ms between jobs) to prevent quota exhaustion.

Workflow

Job Creation: When an image is uploaded (uploadFrameSide) or associated with a frame (addFileToFrameSide), relevant jobs are inserted into the jobs table with:
- name: Job type (e.g., 'resize', 'bees')
- ref_id: Reference to files.id
- priority: 1-5 (lower number = higher priority)
- payload: Job-specific data
- Redis Notification: A pub/sub message is published to jobs:new:{name} channel
Worker Subscription: Background workers (initiated by orchestrator.ts) subscribe to Redis channels:
- Each worker subscribes to jobs:new:{jobType} (e.g., jobs:new:bees)
- Workers block on Redis, consuming no resources when idle
Job Notification: When a job is added:
- Database INSERT completes
- Redis PUBLISH sent to jobs:new:{name}
- All workers subscribed to that channel are notified instantly (<10ms)
Job Locking: Upon notification, worker:
- Queries database for highest priority pending job of that type
- Locks job by setting process_start_time (prevents concurrent processing)
- Jobs timeout after 1 minute if not completed (allows retry)
Execution: The corresponding handler function executes:
- resizeOriginalToThumbnails for resize jobs
- detectWorkerBees for bee detection
- analyzeCells for cell detection
- detectVarroa, detectQueens, analyzeQueenCups for Clarifai jobs
External Calls: Handlers download images from S3/Minio and call ML services:
- Internal models: models-bee-detector, models-frame-resources
- External API: Clarifai (varroa, queens, cups)
Result Storage: Results stored in database tables:
- files_resized for thumbnails
- files_frame_side_rel for detection results (JSON blobs)
- files_frame_side_cells for cell statistics
Job Completion/Failure:
- Success: Set process_end_time, check for more pending jobs
- Failure: Log error, increment calls, set process_end_time
Retries: Failed jobs automatically retry up to 2 times (total 3 attempts)
Notification: Many jobs create a notify job upon completion:
- Handled by notifyViaRedis worker
- Publishes real-time updates to user-specific Redis channels
- Enables live UI updates in web-app

Startup Recovery

On service startup, the system:

Queries database for incomplete jobs (process_end_time IS NULL)
Publishes Redis notifications for each pending job
Workers pick up and process orphaned jobs
Zero job loss even if Redis or service restarts

Worker Architecture & Concurrency

Worker Model:

Single worker per job type: Each job type (resize, bees, drones, cells, etc.) has ONE dedicated worker loop
Sequential processing: Workers process jobs one at a time within their type
No parallel limit: Jobs queue up but process sequentially, preventing overwhelming system resources

Example: Uploading 20 Images

When 20 images are uploaded simultaneously:

140 jobs created (20 images × 7 job types: resize, bees, drones, cells, cups, queens, varroa)
Each worker processes its queue sequentially
No concurrent job limit - all jobs are queued, not rejected
Processing order: Priority (1-5) → Job ID (FIFO within priority)

Worker Concurrency Model:

Upload 20 images
  ↓
[Priority 1: resize] ──> [Job 1] → [Job 2] → [Job 3] → ... → [Job 20]
[Priority 3: bees]   ──> [Job 1] → [Job 2] → [Job 3] → ... → [Job 20]
[Priority 3: drones] ──> [Job 1] → [Job 2] → [Job 3] → ... → [Job 20]
[Priority 3: cells]  ──> [Job 1] → [Job 2] → [Job 3] → ... → [Job 20]
[Priority 5: varroa] ──> [Job 1] → [Job 2] → [Job 3] → ... → [Job 20]
[Priority 5: queens] ──> [Job 1] → [Job 2] → [Job 3] → ... → [Job 20]
[Priority 5: cups]   ──> [Job 1] → [Job 2] → [Job 3] → ... → [Job 20]

Concurrency Limits:

✅ Between workers: 7 workers run concurrently (one per job type)
❌ Within worker: Sequential processing only (1 job at a time per worker)
❌ Image chunks: Processed sequentially within each job (6-16 chunks per image)
⚠️ No queue depth limit: Jobs queue indefinitely, not rejected (could grow unbounded)

Database Connection Usage:

Connection pool size: 20 connections (configurable in src/models/storage.ts:37)
Active connections: ~7-10 (one per active worker + queries)
Logger connections: 3 separate connections (separate pool for log persistence)
Queue timeout: 10 seconds (prevents indefinite waiting)
Connection recycling: After 5000 uses (prevents memory leaks)
Idle timeout: 60 seconds (closes unused connections)

Bee/Drone Detection Flow

Worker bee and drone detection split images into chunks for processing:

Chunking Strategy:

Images are split into a grid based on dimension threshold (typically 800-1024px per chunk)
Calculation: maxCuts = floor(imageDimension / chunkSize)
Example: 4000×3000px image with 1024px chunks → 3×2 grid = 6 chunks
Chunks are processed sequentially (not in parallel)

Processing Pipeline:

Chunk Processing (each chunk processed sequentially):
- Cut image into chunk using Sharp library
- Send chunk to ML service (models-bee-detector or models-drone-bees)
- Receive detections (bounding boxes, confidence scores)
- Transform coordinates from chunk-space to full-image-space
- Atomically append detections to database using JSON_MERGE_PRESERVE
- Increment worker/drone count counters
Partial Updates (optional, after each chunk):
- Published to {uid}.frame_side.{frameSideId}.bees_partially_detected
- Includes isBeeDetectionComplete: false
- Allows progressive UI updates (not currently implemented)
Final Completion (after ALL chunks processed):
- Query final counts from database
- Create NOTIFY_JOB with priority=1
- Published to {uid}.frame_side.{frameSideId}.bees_detected
- Includes isBeeDetectionComplete: true and final counts

Atomic Updates:

Uses MySQL's JSON_MERGE_PRESERVE to avoid race conditions
Each chunk's detections are appended atomically
Counters incremented with IFNULL(count, 0) + newCount
No read-modify-write cycles that could lose data

Memory & Performance:

Full image loaded once at start (using fs.promises.readFile)
Chunks cut from in-memory buffer (no disk I/O per chunk)
Sequential processing prevents memory explosion
Each chunk ~1MB, full image ~5-15MB in memory during processing

Worker Sequence Diagram (Redis-based Queue)

sequenceDiagram
    participant Client
    participant GraphQL as "GraphQL Mutation"
    participant JobsModel
    participant DB [(MySQL DB)]
    participant Redis
    participant Worker as "Worker (subscribed)"
    participant S3 [Object Storage]
    participant MLService as "ML Service"

    Client->>+GraphQL: addFileToFrameSide(...)
    GraphQL->>+JobsModel: addJob('bees', fileId, priority=3)
    JobsModel->>+DB: INSERT INTO jobs (name='bees', ref_id, priority)
    DB-->>-JobsModel: Success
    JobsModel->>+Redis: PUBLISH jobs:new:bees {refId, priority}
    Redis-->>Worker: Message: new 'bees' job available
    Redis-->>-JobsModel: Published
    GraphQL-->>-Client: Success

    Worker->>+JobsModel: checkAndProcessJob('bees')
    JobsModel->>+DB: Fetch & Lock highest priority 'bees' job
    DB-->>-JobsModel: job (ref_id=fileId)
    JobsModel->>+Worker: execute(ref_id, payload)
    Worker->>+DB: Get file metadata (URL, dimensions)
    DB-->>-Worker: file data
    Worker->>+S3: Download image
    S3-->>-Worker: Image bytes
    Worker->>+MLService: Detect bees (image data)
    MLService-->>-Worker: Detection results (JSON)
    Worker->>+DB: Store results in files_frame_side_rel
    DB-->>-Worker: Success
    Worker->>+Redis: PUBLISH {uid}.frame_side.{id}.bees_detected
    Redis-->>-Worker: Published
    Worker-->>-JobsModel: Complete
    JobsModel->>+DB: UPDATE jobs SET process_end_time=NOW()
    DB-->>-JobsModel: Success
    JobsModel->>JobsModel: Check for more pending jobs

Redis Events / Notifications

Upon successful completion of certain asynchronous detection jobs, the service publishes events to a Redis Pub/Sub channel. This allows downstream services (e.g., event-stream-filter, web-app) to react to processing updates in real-time.

Mechanism:

A detection worker (e.g., detectWorkerBees) finishes processing for a file_id.
It marks its specific job as complete in the jobs table.
Crucially, it enqueues a new job of type notify into the jobs table, often including the original file_id (as ref_id) and potentially context about the completed step in the payload.
The notifyViaRedis worker picks up this notify job.
This worker likely queries the database using the ref_id (file_id) to retrieve associated identifiers like user_id (uid) and frame_side_id.
It constructs an event topic string using these identifiers.
It publishes a message (often just a simple notification, the details are usually fetched via GraphQL later) to the specific Redis channel.

Event Pattern:

The general pattern for the Redis channel is: {uid}.frame_side.{frame_side_id}.<detection_type>_detected

Key Events Published:

{uid}.frame_side.{frame_side_id}.bees_detected
{uid}.frame_side.{frame_side_id}.cells_detected (or frame_resources_detected)
{uid}.frame_side.{frame_side_id}.queen_cups_detected
{uid}.frame_side.{frame_side_id}.queens_detected
{uid}.frame_side.{frame_side_id}.varroa_detected

(Note: The exact event names might vary slightly based on implementation details in notifyViaRedis.ts)

Notification Sequence Diagram

sequenceDiagram
    participant Worker as "Worker (e.g., detectWorkerBees)"
    participant JobsModel
    participant DB [(MySQL DB)]
    participant notifyViaRedis as "notifyViaRedis Worker"
    participant Redis

    Worker-->>+JobsModel: Processing Complete for file_id=X
    JobsModel->>+DB: Mark 'bees' job complete for file_id=X
    DB-->>-JobsModel: Success
    JobsModel->>+DB: Add 'notify' job for file_id=X, payload={event:'bees_detected'}
    DB-->>-JobsModel: Success

    loop Poll for 'notify' jobs
        notifyViaRedis->>+JobsModel: processJobInLoop('notify', ...)
        JobsModel->>+DB: Fetch & Lock 'notify' job for file_id=X
        DB-->>-JobsModel: job (ref_id=X, payload={...})
        JobsModel->>+notifyViaRedis: execute(ref_id=X, payload)
        notifyViaRedis->>+DB: Get uid, frame_side_id for file_id=X
        DB-->>-notifyViaRedis: uid=Y, frame_side_id=Z
        notifyViaRedis->>+Redis: PUBLISH channel="Y.frame_side.Z.bees_detected" message="{...}"
        Redis-->>-notifyViaRedis: Success
        notifyViaRedis-->>-JobsModel: Processing Complete
        JobsModel->>+DB: Mark 'notify' job complete for file_id=X
        DB-->>-JobsModel: Success
    end

Configuration

Configuration is managed via files in src/config/ (config.default.ts, config.dev.ts, config.testing.ts) and environment variables. Key configuration options (refer to config.default.ts for defaults):

routerSignature: Secret shared with graphql-router.
sentryDsn: DSN for Sentry error reporting.
schema_registry_url: URL for the GraphQL Schema Registry.
selfUrl: Publicly accessible URL of this service for the registry.
yolo_v5_url: Endpoint for the internal bee detection model.
models_frame_resources_url: Endpoint for the internal cell detection model.
mysql: Database connection details (host, port, user, password, database).
aws: S3/Minio configuration (bucket, key, secret, endpoint, public URL).
jwt.privateKey: Secret key for JWT verification (must match user-cycle service).
clarifai: API Keys (PATs) for different Clarifai applications (varroa, queen, cups, beekeeper AI).

Environment variables like NATIVE (for local vs. Docker) and ENV_ID (dev, testing, prod) influence which configuration values are used.

Development

Prerequisites

Node.js v18+ (check with node --version)
Docker and Docker Compose (for running services)
Just command runner (optional but recommended: brew install just)
MySQL Client (for manual database operations)
Redis CLI (for debugging pub/sub)

Setup

Clone Repository:

git clone https://github.com/Gratheon/image-splitter.git
cd image-splitter

Configuration:
```
cp src/config/config.default.ts src/config/config.dev.ts
```
Update config.dev.ts with:
- AWS/Minio credentials (use Minio for local dev)
- Clarifai PATs (optional, for testing external APIs)
- MySQL connection (defaults work with Docker Compose)
- Redis connection (defaults work with Docker Compose)
Install Dependencies:
```
pnpm install
```
Start Services:
```
just start
# Or manually:
docker-compose -f docker-compose.dev.yml up --build
```
This starts:
- image-splitter service (port 8800)
- MySQL database (port 5100)
- Minio object storage (port 19000, console: 19001)
- Redis (port 6379)

Verify Services:

# Check health
curl http://localhost:8800/healthz

# Check job stats
curl http://localhost:8800/jobs/stats

# Access GraphQL playground
open http://localhost:8800/graphql

# Access Minio console
open http://localhost:19001
# Login: minio-admin / minio-admin

Development Workflow

Hot Reload: The dev environment uses tsc-watch for automatic recompilation:

# Logs show compilation and restart
docker-compose -f docker-compose.dev.yml logs -f image-splitter

View Logs:

# All services
docker-compose -f docker-compose.dev.yml logs -f

# Specific service
docker-compose -f docker-compose.dev.yml logs -f image-splitter

# Follow job processing
docker-compose -f docker-compose.dev.yml logs -f | grep "Processing job"

Database Access:

# Connect to MySQL
docker exec -it image-splitter-mysql-1 mysql -u root -ptest image-splitter

# Or via local client
mysql -h 127.0.0.1 -P 5100 -u root -ptest image-splitter

Redis Monitoring:

# Connect to Redis CLI
docker exec -it image-splitter-redis-1 redis-cli

# Monitor pub/sub messages
> MONITOR

# Subscribe to job notifications
> PSUBSCRIBE jobs:new:*

# Subscribe to user events
> PSUBSCRIBE *.frame_side.*

Useful Just Commands:

just --list          # Show all available commands
just start           # Start dev environment
just stop            # Stop services
just restart         # Restart services
just test-integration # Run integration tests
just logs            # Tail service logs

Database Migrations

Migrations are plain SQL files located in the migrations/ directory. They are applied automatically on service startup using @databases/mysql.

Adding Migrations: Create a new SQL file in migrations/ following the naming convention (e.g., 020-new-feature.sql). The service will apply it on the next restart.
Schema Generation: You can generate TypeScript types from the DB schema using pnpm run generate-db-schema (ensure DB is running).

Testing

The service includes comprehensive unit and integration tests using Jest.

Unit Tests

Tests individual functions and modules in isolation.

Run unit tests:

pnpm run test:unit

Coverage:

Job model logic (jobs.ts)
Data transformation functions
Frame side models
Detection result processing

Configuration: test/unit/jest.coverage.json

Integration Tests

Tests the entire service stack including database, Redis, and S3 interactions.

Run integration tests:

just test-integration

What it does:

Spins up test environment via docker-compose.test.yml:
- MySQL database (port 5100)
- Redis (port 6379)
- Minio object storage (port 19000)
Runs database migrations
Executes test suite against running services
Tears down test environment

Test coverage:

GraphQL mutations and queries
File upload flow
Job queue processing
Redis pub/sub notifications
S3/Minio integration

Configuration: test/integration/jest.coverage.json

Manual Testing

Upload a test image:

curl -X POST http://localhost:8800/graphql \
  -H "token: YOUR_JWT_TOKEN" \
  -F operations='{"query": "mutation($file: Upload!) { uploadFrameSide(file: $file) { id url } }", "variables": {"file": null}}' \
  -F map='{"0": ["variables.file"]}' \
  -F 0=@test-frame.jpg

Check job queue status:

curl http://localhost:8800/jobs/stats | jq

Monitor Redis events:

docker exec -it image-splitter-redis-1 redis-cli
> SUBSCRIBE jobs:new:*
> PSUBSCRIBE *.frame_side.*.bees_detected

Test Data

Sample test images should be added to test/fixtures/ directory with various scenarios:

Different image sizes (small, medium, large)
Various bee densities (empty, sparse, dense)
Different frame types (brood, honey, mixed)
Edge cases (damaged frames, poor lighting)

Troubleshooting

Jobs Not Processing

Symptoms: Jobs stuck in pending state, no processing activity

Check:

# 1. Verify workers are subscribed to Redis
docker logs image-splitter | grep "subscribed to jobs:new"
# Should see: "Worker for bees subscribed to jobs:new:bees" for each job type

# 2. Check Redis connectivity
docker exec -it image-splitter-redis-1 redis-cli PING
# Should return: PONG

# 3. Check job queue
curl http://localhost:8800/jobs/stats | jq '.jobs.bees'
# Look for high pending count

# 4. Check for errors in logs
docker logs image-splitter | grep -i error

# 5. Manually trigger a Redis notification
docker exec -it image-splitter-redis-1 redis-cli
> PUBLISH jobs:new:bees '{"refId":123,"priority":3}'

Solutions:

Restart the service to re-subscribe workers
Check Redis is running: docker ps | grep redis
Verify database connectivity
Check for stuck jobs (process_start_time < NOW() - 1 MINUTE)

Bee/Drone Detection Not Finalizing

Symptoms: Partial results appear but isBeeDetectionComplete never set to true

Check:

# Check if all 9 chunks completed
docker logs image-splitter | grep "Processing part"
# Should see: "Processing part 1/9" through "Processing part 9/9"

# Check for NOTIFY_JOB creation
curl http://localhost:8800/jobs/stats | jq '.jobs.notify'

# Monitor final completion messages
docker logs image-splitter | grep "all chunks processed"

Solutions:

Check ML service availability (models-bee-detector, models-drone-bees)
Verify enough memory for image processing (4GB+ recommended)
Check for chunk processing errors in logs

Redis Pub/Sub Not Working

Symptoms: No real-time updates in web-app

Check:

# Monitor Redis subscriptions
docker exec -it image-splitter-redis-1 redis-cli
> CLIENT LIST
# Look for subscribers

> PUBSUB CHANNELS
# Should see: jobs:new:bees, jobs:new:cells, etc.

# Test pub/sub manually
> SUBSCRIBE jobs:new:bees
> # In another terminal:
> PUBLISH jobs:new:bees "test"

Solutions:

Restart Redis: docker restart image-splitter-redis-1
Check Redis memory usage
Verify network connectivity between services

Database Performance Issues

Symptoms: Slow job processing, timeouts

Check:

# Check active queries
docker exec -it image-splitter-mysql-1 mysql -u root -ptest -e "SHOW PROCESSLIST;"

# Check slow queries
docker exec -it image-splitter-mysql-1 mysql -u root -ptest -e "SELECT * FROM mysql.slow_log LIMIT 10;"

# Check table sizes
docker exec -it image-splitter-mysql-1 mysql -u root -ptest image-splitter -e "
SELECT 
  table_name, 
  ROUND(data_length/1024/1024, 2) AS 'Data Size (MB)',
  table_rows 
FROM information_schema.tables 
WHERE table_schema='image-splitter';"

Solutions:

Add missing indexes (check migration 023-job-priorities.sql)
Archive old completed jobs
Optimize files_frame_side_rel table (large JSON columns)

Out of Memory (OOM)

Symptoms: Container restarts, image processing fails

Check:

# Check memory usage
docker stats image-splitter

# Check memory limit
docker inspect image-splitter | grep -i memory

Solutions:

Increase container memory limit (4-8GB recommended)
Reduce concurrent job processing
Clear tmp directory: docker exec image-splitter rm -rf /app/tmp/*

ML Service Connectivity

Symptoms: Detection jobs fail with connection errors

Check:

# Test bee detector
curl -X POST http://models-bee-detector:8000/detect -F "file=@test.jpg"

# Test frame resources
curl -X POST http://models-frame-resources:8000/detect -F "file=@test.jpg"

# Check DNS resolution
docker exec image-splitter nslookup models-bee-detector

Solutions:

Verify ML services are running
Check Docker network configuration
Update service URLs in config.dev.ts

Common Error Messages

Error	Cause	Solution
`Redis connection timeout`	Redis not running or unreachable	Start Redis container
`Failed to acquire lock on job`	Concurrent processing attempt	Normal, worker will retry
`File not found in S3`	Missing or deleted file	Check S3/Minio bucket
`clarifai.StatusCodeException`	API quota exceeded or invalid PAT	Check Clarifai dashboard
`Cannot read property 'width' of null`	File metadata missing	Re-upload file
`Out of memory`	Image too large or memory limit too low	Increase memory limit

Deployment

The service is designed to run in Docker containers.
Dockerfile.prod defines the production image build process.
docker-compose.yml provides an example of production deployment configuration (though actual deployment might use Kubernetes or other orchestrators).

Production Requirements

Infrastructure:

MySQL Database: For job persistence and data storage
Redis: For job notifications and real-time pub/sub events
S3-compatible Object Storage: AWS S3 (production) or Minio (dev/test)
Internal ML Services:
- models-bee-detector (bee/drone detection)
- models-frame-resources (cell detection)
External APIs: Clarifai API keys (varroa, queens, cups, AI advice)

Environment Configuration:

ENV_ID=prod to use production configuration
Valid Sentry DSN for error monitoring
Clarifai Personal Access Tokens (PATs)
AWS S3 credentials (or Minio endpoint for self-hosted)
MySQL connection details
Redis connection details
JWT private key (shared with user-cycle service)
Router signature (shared with graphql-router)

Deployment Steps

Database Migration: Run all SQL migrations from migrations/ directory

# Migrations are auto-applied on service startup
# Or manually apply:
for file in migrations/*.sql; do
  mysql -h $DB_HOST -u $DB_USER -p$DB_PASS $DB_NAME < $file
done

Build Docker Image:

docker build -f Dockerfile.prod -t image-splitter:latest .

Deploy Container:
- Listens on port 8800 internally
- Requires network access to MySQL, Redis, S3, and ML services
- Set memory limit to at least 4GB (--memory=4g)
- Set CPU limit based on expected load

Verify Deployment:

# Health check
curl http://your-service:8800/healthz

# Job queue stats
curl http://your-service:8800/jobs/stats

# Check logs for worker subscriptions
docker logs image-splitter | grep "subscribed to jobs:new"

Scaling Considerations

Horizontal Scaling: Currently, the service is designed to run as a single instance due to:

Job locking mechanism (prevents duplicate processing)
In-memory worker subscriptions

Future Improvements for Multi-Instance:

Use Redis-based distributed locking
Implement proper job claim mechanism
Add worker instance coordination

Vertical Scaling:

Increase memory for handling larger images (4-8GB recommended)
More CPU cores improve parallel chunk processing (bee/drone detection)

Production Differences from Development

Storage: AWS S3 instead of Minio
Redis: Production Redis instance with persistence
Database: Production MySQL with replication/backups
Monitoring: Sentry integration enabled
ML Services: Production endpoints with higher quotas
SSL/TLS: Handled by ingress/load balancer
Authentication: Stricter JWT validation

Monitoring in Production

Key Metrics to Track:

Job queue depth (/jobs/stats endpoint)
Job processing time (avg per job type)
Job failure rate (failures / total)
Database connection pool usage
Redis pub/sub message rate
S3 request count and latency
Memory usage (track for OOM issues)
External API quota usage (Clarifai)

Alerts to Configure:

Job queue backed up (>100 pending jobs)
High failure rate (>5% for any job type)
Database connection failures
Redis disconnections
S3 upload failures
ML service timeouts
Memory approaching limit

Monitoring & Logging

Error Reporting: Integrated with Sentry (sentryDsn must be configured).
Logging: Uses custom logger (@gratheon/log-lib) with dual output:
- Console output for container logs (stdout/stderr)
- MySQL persistence for log history and debugging

Database Connection Pool Stability

Recent Improvements (as of 2026-01):

To prevent connection pool exhaustion under high load, several optimizations have been implemented:

Connection Pool Configuration (src/models/storage.ts:33-40):

{
  poolSize: 20,                        // Max 20 connections (was: unlimited)
  queueTimeoutMilliseconds: 10000,     // Wait max 10s for connection (was: 60s)
  idleTimeoutMilliseconds: 60000,      // Close idle connections after 60s
  maxUses: 5000,                       // Recycle connections after 5000 uses
}

Impact:

✅ Prevents connection exhaustion when processing many images simultaneously
✅ Faster failure detection (10s timeout vs 60s)
✅ Automatic connection recycling prevents memory leaks
✅ Idle connection cleanup reduces resource usage

Separate Logger Connection Pool:

Logger has its own connection pool (3 connections max)
Prevents logger from consuming application connections
Graceful fallback: if logger DB fails, logs still go to console

Common Connection Pool Errors:

Error	Meaning	Solution
`Timed out waiting for connection from pool`	All 20 connections in use, queue exhausted	Normal under heavy load; jobs will retry
`Failed to persist log to DB`	Logger connection pool exhausted (dev only)	Increase logger pool size or disable DB logging
`CONNECTION_POOL:QUEUE_TIMEOUT`	Waited 10s for connection, none available	Check for slow queries, increase pool size

Monitoring Connection Health:

# Watch connection count in real-time
docker logs image-splitter -f | grep "Active connections"

# Check for connection timeouts
docker logs image-splitter | grep "Timed out waiting"

# Monitor slow queries
docker exec image-splitter-mysql-1 mysql -u root -ptest -e "SHOW PROCESSLIST;"

Tuning Guidelines:

High load (>50 concurrent uploads): Increase poolSize to 30-50
Slow ML services: Increase queueTimeoutMilliseconds to 20000-30000
Memory constrained: Reduce idleTimeoutMilliseconds to 30000
Connection leaks suspected: Reduce maxUses to 1000-2000

Health & Monitoring Endpoints

The service exposes several HTTP endpoints for monitoring:

GET /healthz - Health check endpoint

{
  "status": "ok",
  "mysql": "connected"  // or "disconnected"
}

GET /jobs/stats - Job queue statistics (NEW)

{
  "status": "ok",
  "jobs": {
    "resize": {
      "total": 1523,
      "pending": 5,
      "completed": 1510,
      "failed": 8,
      "avgProcessingTime": 2.34  // seconds
    },
    "bees": { ... },
    "drones": { ... },
    "cells": { ... },
    "cups": { ... },
    "queens": { ... },
    "varroa": { ... },
    "varroa_bottom": { ... },
    "notify": { ... }
  }
}

Use cases:

Monitor job queue depth (pending jobs)
Track failure rates by job type
Identify performance bottlenecks (avg processing time)
Alert on stuck jobs or high error rates

GET / - API documentation page listing all endpoints
GET /graphql - GraphQL playground (development/testing)

License

This project is dual-licensed:

AGPL v3 - For open source use:
- ✅ You can use, modify, and distribute the software
- ✅ Source code is freely available
- ⚠️ If you modify and host a public instance, you must share your modifications
- ⚠️ Any derivative work must also be licensed under AGPL v3
Commercial License - For enterprise customers who need:
- 🏢 On-premise deployment without source disclosure
- 🔧 Custom modifications without copyleft obligations
- 📞 Priority support and SLA guarantees

Contributing

If you'd like to contribute, please see our Contributing Guide and sign our Contributor License Agreement (CLA).

Please refer to the CODEOWNERS file for ownership details.
Follow standard GitHub flow: Fork the repository, create a feature branch, make changes, and submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 328 Commits
.github/workflows		.github/workflows
config		config
migrations		migrations
scripts		scripts
src		src
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
.nvmrc		.nvmrc
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.dev		Dockerfile.dev
Dockerfile.prod		Dockerfile.prod
LICENSE		LICENSE
LICENSE-COMMERCIAL.md		LICENSE-COMMERCIAL.md
README.md		README.md
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.test.yml		docker-compose.test.yml
docker-compose.yml		docker-compose.yml
fix-varroa-image-memory-limit.md		fix-varroa-image-memory-limit.md
justfile		justfile
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
restart.sh		restart.sh
schema.graphql		schema.graphql

Folders and files

Latest commit

History

Repository files navigation

Gratheon / image-splitter

Overview

Key Features

Architecture

High-Level Service Diagram

Technology Stack

API Documentation

URLs

Key Mutations

Key Queries

Upload Sequence Diagram

Database Schema

Entity-Relationship Diagram (ERD)

Asynchronous Processing

Architecture Overview

Job Types and Priorities

Workflow

Startup Recovery

Worker Architecture & Concurrency

Bee/Drone Detection Flow

Worker Sequence Diagram (Redis-based Queue)

Redis Events / Notifications

Notification Sequence Diagram

Configuration

Development

Prerequisites

Setup

Development Workflow

Database Migrations

Testing

Unit Tests

Integration Tests

Manual Testing

Test Data

Troubleshooting

Jobs Not Processing

Bee/Drone Detection Not Finalizing

Redis Pub/Sub Not Working

Database Performance Issues

Out of Memory (OOM)

ML Service Connectivity

Common Error Messages

Deployment

Production Requirements

Deployment Steps

Scaling Considerations

Production Differences from Development

Monitoring in Production

Monitoring & Logging

Database Connection Pool Stability

Health & Monitoring Endpoints

License

Contributing

About

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages