Pipeline Compilation System Architecture
The Problem: Runtime Errors in Image Processing Pipelines
Image processing pipelines often fail at runtime with cryptic errors: “GPU out of memory”, “incompatible array types”, “file not found”. These failures happen after hours of processing, wasting computational resources and researcher time. Without compile-time validation, users can’t catch configuration errors until execution begins. Additionally, resource allocation decisions (which GPU, which backend) are scattered throughout the code, making optimization impossible.
The Solution: Compile-Time Validation and Resource Planning
OpenHCS implements a declarative, compile-time pipeline system that treats configuration as a first-class compilation target. This architecture separates pipeline definition from execution, enabling compile-time validation, resource optimization, and reproducible execution. The compiler catches errors before execution begins and makes optimal resource allocation decisions upfront.
Overview
OpenHCS implements a declarative, compile-time pipeline system that treats configuration as a first-class compilation target. This architecture separates pipeline definition from execution, enabling compile-time validation, resource optimization, and reproducible execution.
Note: Function patterns shown use real OpenHCS API with function objects and parameter tuples, matching TUI-generated script patterns.
Core Philosophy
The system is designed around three fundamental principles:
Declaration Phase: Functions declare their contracts via decorators
Compilation Phase: Multi-pass compiler builds execution plans
Execution Phase: Stateless execution against immutable contexts
This approach is analogous to a programming language compiler, but for data processing pipelines.
Multi-Pass Compiler Architecture
The pipeline compiler operates in five sequential phases, each building upon the previous:
Phase 1: Step Plan Initialization (initialize_step_plans_for_context)
Purpose: Establishes the data flow topology, registers ObjectState layers, and resolves a deterministic, saved-value representation of each step
Input: Step definitions + ProcessingContext (well_id, input_dir)
- Output: Initialized
step_planswith input/output directories and special I/O paths
- Output: Initialized
Responsibilities:
Creates basic step plan structure for each step
Calls
PipelinePathPlanner.prepare_pipeline_paths()for path resolutionRegisters global/orchestrator/step
ObjectStatescopes and resolvestheir saved values before metadata injection
Determines input/output directories for each step
Creates VFS paths for special I/O (cross-step communication)
Links special outputs from one step to special inputs of another
Handles chain breaker logic and input source detection
Key Insight:
The compiler returns (resolved_steps, step_state_map), and the
step_state_map is consumed by later phases (notably memory
validation/streaming config collection) so only the saved configuration
values used during the resolution pass influence execution plans.
Key Error:
"Context step_plans must be initialized before path planning" -
Indicates this phase failed to properly initialize the step_plans
structure
Phase 2: ZARR Store Declaration (declare_zarr_stores_for_context)
Purpose: Declares ZARR stores for steps that will use ZARR backend
Input:
step_plansfrom Phase 1 with path informationOutput: ZARR store declarations in step plans (
zarr_configentries)Responsibilities:
Identifies steps that will use ZARR materialization backend
Declares ZARR stores with appropriate configuration metadata
Sets up well coordination information for multi-well ZARR stores
Provides configuration for runtime store creation
Key Insight: This phase doesn’t create ZARR stores - it declares which steps will need them and provides the metadata for runtime store creation. The actual ZARR stores are created during execution when the first well writes data.
Phase 3: Materialization Planning (plan_materialization_flags_for_context)
Purpose: Decides where data lives (VFS backend strategy)
Input:
step_plansfrom Phase 2 with ZARR store declarationsOutput: Backend selection (disk vs memory) for each step
Strategy:
First step: Always reads from disk (input images)
Last step: Always writes to disk (final outputs)
Middle steps: Can use memory backend for speed
FunctionSteps: Can use intermediate backends
Non-FunctionSteps: Must use persistent backends
Phase 4: Memory Contract Validation (validate_memory_contracts_for_context)
Purpose: Ensures memory type compatibility across the pipeline AND stores function patterns
Input:
step_plans+ function decorators + modified function patterns from path plannerOutput: Memory type validation, function pattern storage, and injection into step_plans
Key Responsibility: Stores the function pattern (potentially modified by path planner) in ``step_plans[‘func’]``
Implementation: Uses
FuncStepContractValidatorinternallyValidation:
All functions must have explicit memory type declarations
Functions in the same step must have consistent memory types
Memory types must be valid (numpy, cupy, torch, tensorflow, jax)
Storage: Returns
{'input_memory_type': ..., 'output_memory_type': ..., 'func': func_pattern}
Phase 5: GPU Resource Assignment (assign_gpu_resources_for_context)
Purpose: Resource allocation for GPU-accelerated steps
Input:
step_planswith memory typesOutput: GPU device assignments
Implementation: Uses
GPUMemoryTypeValidatorinternallyLogic:
Identifies steps requiring GPU memory types
Assigns available GPU devices
Validates GPU resource availability
Function Pattern System
The Sacred Four Patterns
OpenHCS supports four fundamental function execution patterns that provide unified handling of different processing strategies:
1. Single Function Pattern
from openhcs.processing.backends.processors.cupy_processor import tophat
FunctionStep(func=(tophat, {'selem_radius': 50}))
Use Case: Apply function with parameters to all data
Execution:
tophat(image_stack, selem_radius=50)for each pattern group
2. Sequential Function Chain
from openhcs.processing.backends.processors.cupy_processor import stack_percentile_normalize, tophat
FunctionStep(func=[
(stack_percentile_normalize, {'low_percentile': 1.0}),
(tophat, {'selem_radius': 50})
])
Use Case: Apply multiple functions in sequence
Execution:
tophat(stack_percentile_normalize(image_stack, low_percentile=1.0), selem_radius=50)for each pattern group
3. Component-Specific Functions (Dict Pattern)
from openhcs.processing.backends.analysis.cell_counting_cpu import count_cells_single_channel
from openhcs.processing.backends.analysis.skan_axon_analysis import skan_axon_skeletonize_and_analyze
FunctionStep(func={
'1': (count_cells_single_channel, {'min_sigma': 1.0}),
'2': skan_axon_skeletonize_and_analyze # No parameters needed
})
Use Case: Different processing per component (channel, site, etc.)
Execution:
count_cells_single_channelfor channel 1 data,skan_axon_skeletonize_and_analyzefor channel 2 data
4. Bare Function Pattern (Simple Cases)
from openhcs.processing.backends.assemblers.assemble_stack_cupy import assemble_stack_cupy
FunctionStep(func=assemble_stack_cupy)
Use Case: Function with no parameters
Execution:
assemble_stack_cupy(image_stack)for each pattern group
Pattern Resolution Flow
Pattern Detection:
microscope_handler.auto_detect_patterns()finds image files matching well/component criteriaPattern Grouping:
prepare_patterns_and_functions()groups patterns by component and resolves func patternsExecution: For each pattern group: load images → stack → process → unstack → save
Decorator System
Memory Type Decorators
Functions declare their memory interface using decorators:
@torch(input_type="torch", output_type="torch")
def my_function(image_stack):
return processed_stack
@numpy # Shorthand for numpy input/output
def another_function(data):
return result
Supported Memory Types: numpy, cupy, torch,
tensorflow, jax, pyclesperanto
Benefits: - No runtime overhead - pure metadata - Enables compile-time memory type checking - Supports automatic memory type conversion planning
Special I/O Decorators
Functions declare cross-step dependencies:
@special_outputs("positions", "metadata")
def generate_positions(image_stack):
return processed_stack, positions, metadata
@special_inputs("positions")
def stitch_images(image_stack, positions):
return stitched_stack
Compiler Behavior: - Automatically links outputs to inputs - Creates VFS paths for intermediate data - Validates dependency chains at compile time
Chain Breaker Decorator
@chain_breaker
def independent_function(image_stack):
return result
Forces the next step to read from the pipeline’s original input directory rather than the previous step’s output.
Virtual File System (VFS)
Abstraction Layer
The VFS provides a unified interface for all storage operations:
# Same API regardless of backend
filemanager.save(data, "path/to/data", "memory")
filemanager.save(data, "path/to/data", "disk")
data = filemanager.load("path/to/data", "memory")
Backend Types
Memory Backend: Fast intermediate data (numpy arrays, tensors)
Disk Backend: Persistent data (images, final outputs)
Zarr Backend: Chunked array storage (future)
Location Transparency
Data can be moved between backends without changing application code. The materialization planner decides optimal storage locations based on: - Step position in pipeline - Step type (FunctionStep vs others) - Resource constraints - Performance requirements
ProcessingContext Lifecycle
1. Creation
context = ProcessingContext(
global_config=config,
well_id="A01",
filemanager=filemanager
)
2. Population (Compilation)
# Phase 1: Step plan initialization
resolved_steps, step_state_map = PipelineCompiler.initialize_step_plans_for_context(
context,
steps,
orchestrator,
steps_already_resolved=False,
)
# Phase 2: ZARR store declaration
PipelineCompiler.declare_zarr_stores_for_context(context, resolved_steps, orchestrator)
# Phase 3: Materialization planning
PipelineCompiler.plan_materialization_flags_for_context(context, resolved_steps, orchestrator)
# Phase 4: Memory contract validation + function pattern storage
PipelineCompiler.validate_memory_contracts_for_context(
context,
resolved_steps,
orchestrator,
step_state_map=step_state_map,
)
# This phase validates memory types, stores function patterns in step_plans['func'],
# and builds ``context.required_visualizers`` using streaming configs from the ObjectState map.
# Phase 5: GPU resource assignment
PipelineCompiler.assign_gpu_resources_for_context(context, resolved_steps, orchestrator)
3. Freezing
context.freeze() # Makes context immutable
4. Execution
for step in steps:
step.process(context) # Read-only access to frozen context
Step Plans Structure
Each step gets a comprehensive execution plan:
context.step_plans[step_id] = {
# Basic metadata
"step_name": "Z-Stack Flattening",
"step_type": "FunctionStep",
"well_id": "A01",
# I/O configuration
"input_dir": "/path/to/input",
"output_dir": "/path/to/output",
"read_backend": "disk",
"write_backend": "memory",
# Memory configuration
"input_memory_type": "numpy",
"output_memory_type": "torch",
"gpu_id": 0,
# Function pattern (CRITICAL: stored by FuncStepContractValidator)
"func": function_pattern, # The actual function pattern (potentially modified by path planner)
# Special I/O
"special_inputs": {
"positions": {"path": "/vfs/positions.pkl", "backend": "memory"}
},
"special_outputs": {
"metadata": {"path": "/vfs/metadata.pkl", "backend": "memory"}
},
# Flags
"force_disk_output": False,
"visualize": False
}
Execution Model
Stateless Steps
After compilation, step objects become pure templates: - All
configuration lives in context.step_plans[step_id] - Same step
definition reused across wells with different configs - Functional
programming approach to pipeline execution
VFS-Based Data Flow
No direct data passing between steps
All data flows through VFS paths specified in step_plans
Location transparency: data can be in memory or on disk
Automatic serialization/deserialization based on backend
Error Handling
The system is designed to fail fast during compilation rather than during execution:
Missing memory type declarations → Compilation error
Incompatible memory types → Compilation error
Missing special input dependencies → Compilation error
Invalid step plan structure → Compilation error
This approach prevents expensive pipeline failures after processing has begun.
See Also
Core Integration:
The Function Pattern System - Function patterns compiled by this system
Memory Type System and Stack Utils - Memory type validation and conversion
Special I/O System: Cross-Step Communication and Dict Pattern Integration - Cross-step communication compilation
Practical Usage:
Pipeline Compilation Workflow - Complete compilation workflow guide
API Reference - API reference (autogenerated from source code)
Advanced Topics:
OpenHCS Pipeline Compilation System - Complete Architecture - Deep dive into 5-phase compilation
Multiprocessing Coordination System - Multiprocessing coordination
GPU Resource Management System - GPU resource assignment details