Storage and Memory System Architecture
Overview
OpenHCS implements a unified storage and memory system that addresses common challenges of scientific image processing: how to efficiently handle datasets that range from small test images to large experimental plates while maintaining performance, type safety, and integration across different computational backends.
The Scientific Computing Challenge: Traditional image analysis tools can struggle with large high-content screening datasets. A typical experiment might generate many individual TIFF files totaling substantial amounts of data. Common issues include out-of-memory errors, slow processing times, or data corruption during format conversions.
The OpenHCS Solution: A two-layer architecture that combines a Virtual File System (VFS) for storage abstraction with a Memory Type System for computational backend management. This allows the same code to work whether processing small test images or large experimental datasets, while automatically handling conversions between NumPy, PyTorch, CuPy, TensorFlow, JAX, and pyclesperanto arrays.
Key Feature: The system treats storage backends and memory types as orthogonal concerns. You can store data in memory, on disk, or in compressed ZARR format regardless of whether it’s a NumPy array, PyTorch tensor, or CuPy array. This separation enables optimization strategies that would be difficult with tightly coupled systems.
Virtual File System (VFS) Architecture
The VFS layer addresses the “storage backend explosion” problem common in scientific computing: different tools have their own preferred storage formats, leading to many conversion utilities and format compatibility issues.
The Problem: Scientific workflows often involve multiple tools, each with different storage preferences. ImageJ works with TIFF files, deep learning frameworks prefer HDF5 or custom formats, and analysis tools often require CSV or JSON outputs. Managing these format differences manually leads to brittle pipelines and data corruption risks.
The Solution: A unified abstraction layer that provides location-transparent data access across different storage backends. The same logical path works whether data is stored in memory for speed, on disk for persistence, or in compressed ZARR format for large datasets.
Backend Abstraction
The VFS abstracts away the underlying storage mechanism through a common interface that hides complexity while enabling optimization:
# Same API regardless of where data is stored
filemanager.save(data, "path/to/data", Backend.MEMORY)
filemanager.save(data, "path/to/data", Backend.DISK)
filemanager.save(data, "path/to/data", Backend.ZARR)
# Load from any backend
data = filemanager.load("path/to/data", Backend.MEMORY)
data = filemanager.load("path/to/data", Backend.DISK)
Why This Matters: The same processing code works regardless of where data is stored. During development, you might use the memory backend for speed. For production runs, you might use the disk backend for reliability. For large datasets, you might use the ZARR backend for compression. The processing logic never changes.
Backend Type Hierarchy
OpenHCS defines a hierarchy of backend abstractions to support different storage paradigms:
# Base interface for all data destinations
class DataSink(ABC):
"""Minimal interface for sending data to any destination."""
@abstractmethod
def save(self, data: Any, identifier: Union[str, Path], **kwargs) -> None:
"""Send data to the destination."""
pass
@abstractmethod
def save_batch(self, data_list: List[Any], identifiers: List[Union[str, Path]], **kwargs) -> None:
"""Send multiple data objects in a single operation."""
pass
# Backends with real filesystems (disk, memory, zarr)
class StorageBackend(DataSink):
"""Persistent storage with file-like semantics and retrieval capabilities."""
@abstractmethod
def load(self, file_path: Union[str, Path], **kwargs) -> Any:
"""Load data from storage."""
pass
# Backends without real filesystems (OMERO, cloud storage)
class VirtualBackend(DataSink):
"""Virtual filesystem semantics - generates file listings on-demand."""
@abstractmethod
def list_files(self, directory: str, **kwargs) -> List[str]:
"""Generate file list from metadata (not real filesystem)."""
pass
@abstractmethod
def generate_filename(self, metadata: Dict, **kwargs) -> str:
"""Generate filename from metadata."""
pass
Key Distinction: StorageBackend provides traditional file operations (load/save), while VirtualBackend generates filenames on-demand from metadata without real filesystem operations. This enables location-transparent processing where the same pipeline code works on disk, zarr, or OMERO without modification.
Examples:
StorageBackend: DiskStorageBackend, MemoryStorageBackend, ZarrStorageBackend
VirtualBackend: OMEROLocalBackend (generates paths from OMERO plate structure), VirtualWorkspaceBackend (metadata-based path mapping)
StreamingBackend: NapariStreamBackend, FijiStreamBackend (real-time visualization)
See OMERO Backend System for detailed VirtualBackend architecture and OMERO integration.
Virtual Workspace Backend
Purpose: Metadata-based workspace initialization without physical file operations
When to Use: Microscope formats with complex directory structures (TimePoint folders, ZStep folders) that need flattening for processing.
The Problem: Some microscope formats (ImageXpress, Opera Phenix) organize images in nested folders (e.g., TimePoint_1/ZStep_2/image.tif). Traditional workspace preparation creates physical symlinks or copies files to flatten the structure, which is slow and wastes disk space.
The Solution: VirtualWorkspaceBackend stores a plate-relative path mapping in openhcs_metadata.json that translates virtual flattened paths to real nested paths without any physical file operations.
Architecture:
# Metadata-based mapping (stored in openhcs_metadata.json)
{
"workspace_mapping": {
"images/A01_s001_w1_z001_t001.tif": "TimePoint_1/ZStep_1/A01_s001_w1.tif",
"images/A01_s001_w1_z001_t002.tif": "TimePoint_2/ZStep_1/A01_s001_w1.tif"
}
}
# VirtualWorkspaceBackend translates paths on-the-fly
backend.load("images/A01_s001_w1_z001_t001.tif")
# → Actually loads "TimePoint_1/ZStep_1/A01_s001_w1.tif"
Integration with Microscope Handlers:
Microscope handlers that need virtual workspace mapping (e.g., ImageXpress, Opera Phenix) implement _build_virtual_mapping() to generate the workspace mapping. This method is optional and only needed for handlers that use the base class initialize_workspace() implementation.
Handlers that override initialize_workspace() completely (like OMERO and OpenHCS) don’t need to implement this method because they handle workspace initialization differently:
OMERO: Uses database-backed virtual filesystem, doesn’t need file-based mapping
OpenHCS: Already uses normalized format, doesn’t need directory flattening
Example Implementation (ImageXpress, Opera Phenix):
def _build_virtual_mapping(self, plate_path: Path, filemanager: FileManager) -> Path:
"""Build virtual workspace mapping for nested folder structures."""
workspace_mapping = {}
# Scan for TimePoint/ZStep folders and build mappings
for subdir in subdirs:
self._flatten_timepoints(subdir, filemanager, workspace_mapping, plate_path)
self._flatten_zsteps(subdir, filemanager, workspace_mapping, plate_path)
# Save mapping to metadata
metadata_writer.update_metadata(
plate_path,
{"workspace_mapping": workspace_mapping}
)
return plate_path
Backend Selection and Preference Hierarchy:
OpenHCS uses a strict backend preference hierarchy when multiple backends are available:
Preference Order: zarr > virtual_workspace > disk
This hierarchy ensures that:
Zarr backend is preferred when available (optimized for large datasets, compressed storage)
Virtual workspace is used for original nested microscope data (when zarr not available)
Disk backend is the fallback (standard TIFF files)
Main Subdirectory Selection:
When loading metadata with multiple subdirectories, OpenHCS selects the subdirectory marked with "main": true:
{
"subdirectories": {
".": {
"workspace_mapping": {...},
"available_backends": {"disk": true, "virtual_workspace": true},
"main": false
},
"zarr": {
"image_files": [...],
"available_backends": {"zarr": true},
"main": true
}
}
}
The main subdirectory determines:
Which backend is used by
get_primary_backend()Which image list is returned by
get_image_files()Which metadata is used for pipeline execution
Automatic Backend Selection:
The virtual workspace backend is automatically selected when:
Microscope handler’s
get_available_backends()detectsworkspace_mappingin metadataFileManager registers VirtualWorkspaceBackend in local registry
get_primary_backend()returns"virtual_workspace"for reading original data (if zarr not available)
Metadata emission (centralized helper):
Virtual workspace handlers now use a shared helper to write
openhcs_metadata.jsonso all metadata fields stay consistent.The helper records the workspace mapping plus handler/parser names and optional grid/pixel metadata if available.
Orchestrator metadata extraction respects the handler’s chosen primary backend (e.g., virtual workspace for ImageXpress/Opera), instead of forcing
disk.
Materialization Compatibility:
Virtual workspace works seamlessly with materialization backends:
Read backend:
virtual_workspace(for original nested data)Write backend:
diskorzarr(for flattened outputs)
This enables processing pipelines to read from nested structures while writing to flattened outputs without any physical workspace preparation.
Performance:
Initialization: ~100ms to build mapping for 1000 files (vs ~10s for symlink creation)
Runtime overhead: Negligible (simple dictionary lookup per file access)
Disk space: Zero overhead (no symlinks or copies)
Supported Microscope Formats:
ImageXpress: TimePoint and ZStep folder flattening
Opera Phenix: Nested field/plane folder flattening
See Microscope Handler Integration for microscope-specific implementation details.
Path Virtualization
VFS provides a unified path interface where the same logical path works across all backends:
Unified Path:
/pipeline/step1/output/processed_imagesMemory Backend: Stores in-memory using the same path as key
Disk Backend: Maps to physical file using the same path structure
Zarr Backend: Creates zarr store using the same path structure
The Design Philosophy: Paths are identical across all backends - the VFS handles the backend-specific storage implementation transparently. This means you can switch storage strategies without changing any processing code.
Storage Backend Types
Memory Backend
Purpose: Fast intermediate data storage for processing pipelines
When to Use: Temporary arrays and tensors between pipeline steps where speed is critical and persistence isn’t required.
Characteristics: - Fast access (direct object access) - Limited by available RAM - Volatile (lost on process exit) - Supports any Python object
Real-World Usage: In image processing pipelines, intermediate results like filtered images or segmentation masks are stored in the memory backend for speed.
Materialization Integration: When steps need to save additional outputs (like analysis results), the memory backend serves as the staging area before materialization to persistent storage.
Disk Backend
Purpose: Persistent data storage with standard file format support
When to Use: Input images, final outputs, checkpoints, and any data that needs to survive process restarts.
Characteristics: - Persistent across runs - Slower than memory but faster than network storage - Unlimited capacity (limited only by disk space) - Supports standard file formats (TIFF, PNG, NPY, etc.)
Real-World Usage: Original microscopy images are loaded from the disk backend, and final analysis results are saved back to disk for long-term storage.
Zarr Backend
Purpose: Chunked array storage with OME-ZARR support for large datasets
When to Use: Large multidimensional arrays, compressed storage, and datasets that need to be accessed from multiple tools.
Characteristics: - Efficient for large arrays - Supports compression (ZSTD, LZ4) with significant size reduction - Cloud storage compatible - OME-ZARR metadata support for interoperability - Parallel access for multi-threaded processing
Real-World Usage: Final processed datasets from high-content screening experiments are stored in ZARR format for sharing and analysis.
Memory Type System
The Memory Type System addresses the “computational backend fragmentation” problem in scientific Python: different libraries use different array types, leading to conversion overhead and compatibility issues.
The Problem: Modern scientific computing involves multiple specialized libraries. NumPy provides the foundation, but PyTorch is used for deep learning, CuPy for GPU acceleration, pyclesperanto for image processing, and JAX for high-performance computing. Each library has its own array type, and converting between them can be error-prone and slow.
The Solution: A unified memory type system that handles conversions automatically while maintaining type safety and performance. The system knows how to convert between supported array types and can optimize conversions to minimize data copying.
Supported Memory Types
Memory Type |
Library |
GPU Support |
Use Cases |
Image Serialization |
|---|---|---|---|---|
|
NumPy |
No |
CPU processing, I/O operations |
|
|
CuPy |
Yes |
GPU-accelerated NumPy-like operations |
|
|
PyTorch |
Yes |
Deep learning, neural networks |
|
|
TensorFlow |
Yes |
Machine learning, TensorFlow models |
|
|
JAX |
Yes |
High-performance computing, research |
|
|
pyclesperanto |
Yes |
GPU-accelerated image processing |
|
Important Note: Regardless of the memory type used during processing, all image data is converted to NumPy arrays for serialization. The disk backend saves images as standard TIFF files, while the zarr backend saves them as compressed zarr chunks. The memory type only affects computational processing, not storage format.
Design Principle: Each memory type is optimized for specific use cases, but the conversion system ensures they can all work together seamlessly.
Automatic Type Conversion
The system implements intelligent conversion strategies that minimize performance overhead. When functions require specific memory types, the system automatically converts between them while preserving data integrity.
Conversion Strategy: The memory type system detects the required input type for each function and converts the data accordingly. After processing, the result maintains the target memory type for subsequent operations.
Example Conversion Flow (conceptual illustration):
# Conceptual example - not actual OpenHCS API
# Step 1: Load TIFF → process with CuPy function
# System converts numpy → cupy automatically
# Step 2: Process with PyTorch function
# System converts cupy → torch automatically
# Step 3: Process with NumPy function → save to disk
# System converts torch → numpy automatically
Real OpenHCS Pipeline Example:
# Actual OpenHCS FunctionStep API
pipeline = [
FunctionStep(func="gaussian_filter", sigma=2.0), # CuPy function
FunctionStep(func="threshold_otsu"), # scikit-image function
FunctionStep(func="binary_opening", footprint=disk(3)) # CuPy function
]
Conversion Optimization: The system uses zero-copy transfers where possible (like CuPy ↔ PyTorch via DLPack) and reduces CPU-GPU transfers by keeping data on the GPU when consecutive operations support it.
Memory Type Declaration System
OpenHCS functions declare their memory interface using decorators that enable automatic type conversion and GPU memory management. This system enforces explicit memory type contracts while providing automatic optimization.
Basic Memory Type Decorators:
from openhcs.core.memory.decorators import numpy, cupy, torch, jax, pyclesperanto
@numpy
def process_cpu(image_stack):
"""CPU processing with NumPy arrays."""
import numpy as np
return np.median(image_stack, axis=0, keepdims=True)
@cupy
def process_gpu_cupy(image_stack):
"""GPU processing with CuPy arrays."""
import cupy as cp
return cp.median(image_stack, axis=0, keepdims=True)
@torch(oom_recovery=True)
def process_gpu_torch(image_stack):
"""GPU processing with PyTorch tensors and automatic OOM recovery."""
import torch
return torch.median(image_stack, dim=0, keepdim=True)[0]
@pyclesperanto(oom_recovery=True)
def process_gpu_opencl(image_stack):
"""GPU processing with pyclesperanto OpenCL arrays."""
import pyclesperanto_prototype as cle
return cle.median_sphere(image_stack, radius_x=1, radius_y=1, radius_z=0)
Advanced Memory Type Specification:
from openhcs.core.memory.decorators import memory_types
# Mixed input/output types
@memory_types(input_type="numpy", output_type="torch")
def neural_network_inference(image_stack):
"""Convert NumPy input to PyTorch for GPU inference."""
import torch
# Function receives NumPy array, returns PyTorch tensor
model = torch.load('model.pt')
return model(image_stack)
# Explicit type specification with custom settings
@torch(input_type="torch", output_type="torch", oom_recovery=True)
def memory_intensive_operation(image_stack):
"""GPU operation with automatic OOM recovery."""
# Automatic GPU memory management and thread-local CUDA streams
return torch.nn.functional.conv3d(image_stack, kernel)
Automatic Features:
Thread-Local CUDA Streams: Each thread gets persistent CUDA streams for true parallelization
OOM Recovery: Automatic out-of-memory recovery with CPU fallback
Device Management: Automatic GPU device placement and management
Type Validation: Runtime validation of input/output memory types
Stack/Unstack Operations
OpenHCS enforces a 3D array discipline to prevent shape-related bugs that are common in image processing pipelines:
The Problem: Scientific image processing often involves mixing 2D and 3D operations, leading to shape mismatches and silent failures. A function expecting a 3D stack might receive a 2D image, or vice versa.
The Solution: All functions must operate on 3D arrays of shape [Z, Y, X], even for single 2D images. The stack/unstack system handles conversions between 2D image lists and 3D arrays while maintaining type safety.
Stack/Unstack API (conceptual - actual implementation may vary):
# Conceptual example of stack/unstack operations
# Convert list of 2D images to 3D array with specified memory type
stack_3d = stack_slices(
slices=[img1_2d, img2_2d, img3_2d], # List of 2D arrays (any memory type)
memory_type="torch", # Target memory type
gpu_id=0 # GPU device ID
)
# Returns: torch.Tensor of shape [3, Y, X] on GPU 0
# Convert 3D array back to list of 2D slices
slices_2d = unstack_slices(
array=stack_3d, # 3D array (any memory type)
memory_type="numpy", # Target memory type for output slices
gpu_id=0 # GPU device ID
)
Validation Benefits: This approach catches shape errors at the boundary between 2D and 3D operations, preventing silent failures that could corrupt scientific results.
Real-World Usage: In practice, the OpenHCS pipeline automatically handles stacking and unstacking as images flow between processing steps, ensuring consistent 3D array format throughout the pipeline.
Memory Conversion System
OpenHCS implements a comprehensive memory conversion system that enables seamless transitions between different array libraries while maintaining GPU efficiency and data integrity.
Conversion Architecture:
# Zero-copy conversions (preferred)
def _cupy_to_torch_dlpack(data, device_id=None):
"""Convert CuPy to PyTorch using DLPack (zero-copy)."""
import torch
dlpack = data.toDlpack()
return torch.from_dlpack(dlpack)
def _torch_to_cupy_dlpack(data, device_id=None):
"""Convert PyTorch to CuPy using DLPack (zero-copy)."""
import cupy as cp
dlpack = data.__dlpack__()
return cp.fromDlpack(dlpack)
# CUDA Array Interface conversions
def _cupy_to_pyclesperanto_cuda(data, device_id=None):
"""Convert CuPy to pyclesperanto using CUDA Array Interface."""
import pyclesperanto_prototype as cle
cle.select_device(device_id or 0)
return cle.asarray(data) # Uses __cuda_array_interface__
Conversion Strategy Hierarchy:
Zero-Copy GPU-to-GPU: DLPack, CUDA Array Interface (preferred)
CPU Roundtrip: Fallback when direct GPU conversion fails
Error Handling: Detailed error reporting with conversion context
class MemoryConversionError(Exception):
"""Raised when memory type conversion fails."""
def __init__(self, source_type: str, target_type: str, method: str, reason: str):
self.source_type = source_type
self.target_type = target_type
self.method = method
self.reason = reason
super().__init__(f"Failed to convert {source_type} → {target_type} via {method}: {reason}")
GPU Memory Management: The conversion system integrates with OpenHCS GPU cleanup utilities:
def cleanup_gpu_memory_by_framework(memory_type: str, device_id: Optional[int] = None):
"""Clean up GPU memory based on OpenHCS memory type."""
if memory_type == "torch":
cleanup_pytorch_gpu(device_id)
elif memory_type == "cupy":
cleanup_cupy_gpu(device_id)
elif memory_type == "pyclesperanto":
cleanup_pyclesperanto_gpu(device_id)
# ... other frameworks
def cleanup_all_gpu_frameworks(device_id: Optional[int] = None):
"""Comprehensive GPU cleanup for all frameworks."""
cleanup_pytorch_gpu(device_id)
cleanup_cupy_gpu(device_id)
cleanup_tensorflow_gpu(device_id)
cleanup_jax_gpu(device_id)
cleanup_pyclesperanto_gpu(device_id)
Conversion Performance: The system prioritizes GPU-to-GPU transfers and minimizes CPU roundtrips, achieving near-zero overhead for compatible memory types.
Materialization System
The materialization system bridges the gap between computational processing and persistent storage. It handles the conversion of function side effects (analysis results, metadata, derived data) from memory backend staging to persistent storage in appropriate formats.
The Problem: Scientific image processing functions often produce valuable side effects beyond the main image output - cell counts, position coordinates, analysis metrics, segmentation masks. These need to be saved in formats that researchers can use with standard analysis tools (CSV, JSON, TIFF), but the computational functions work with Python objects in memory.
The Solution: A materialization system that automatically converts function side effects to appropriate file formats and saves them using the storage backend system. This provides a clean separation between computational logic and storage concerns.
Special Output Decoration
Functions declare their side effects using the @special_outputs decorator, which can optionally specify materialization specs (resolved via format writers) for converting data to persistent formats.
Basic Special Outputs (memory backend only):
from openhcs.core.pipeline.function_contracts import special_outputs, special_inputs
@special_outputs("positions", "metadata")
def generate_positions(image_stack):
"""Function that produces special outputs stored in memory."""
positions = calculate_positions(image_stack)
metadata = extract_metadata(image_stack)
# Return: (main_output, special_output_1, special_output_2, ...)
return processed_image, positions, metadata
With Materialization Specs (memory + persistent storage):
from openhcs.processing.materialization import MaterializationSpec, CsvOptions, ROIOptions
@special_outputs(("cell_counts", MaterializationSpec(CsvOptions(filename_suffix="_details.csv"))),
("masks", MaterializationSpec(ROIOptions())))
def count_cells_with_materialization(image_stack):
"""Function with materialized special outputs."""
processed_image, cell_counts, segmentation_masks = analyze_cells(image_stack)
# cell_counts and masks are automatically materialized to disk
return processed_image, cell_counts, segmentation_masks
Mixed Declaration (some materialized, some memory-only):
from openhcs.processing.materialization import MaterializationSpec, CsvOptions
@special_outputs("debug_info", ("analysis_results", MaterializationSpec(CsvOptions(filename_suffix=".csv"))))
def analyze_with_mixed_outputs(image_stack):
"""Function with both memory-only and materialized outputs."""
# debug_info stays in memory, analysis_results gets materialized
return processed_image, debug_info, analysis_results
Note: Writer dispatch is automatically inferred from options type. No need to specify handler strings.
The fields parameter is optional and defaults to None (auto-extract all fields).
Only specify fields when you need to control column ordering or select a subset.
Writer-Based Materialization
Materialization is implemented as a small set of format writers (CSV/JSON/ROI ZIP/TIFF/TEXT). Call sites never register custom materializers.
Instead, functions declare what they want persisted by providing writer options to MaterializationSpec.
from openhcs.processing.materialization import MaterializationSpec, CsvOptions, JsonOptions
# JSON summary + CSV details, JSON is primary
spec = MaterializationSpec(
JsonOptions(source="summary", filename_suffix=".json"),
CsvOptions(source="details", filename_suffix="_details.csv"),
primary=0,
)
The framework handles:
path normalization (including compound suffix stripping)
directory creation and overwrite/delete semantics
multi-backend iteration
unique_x = sorted(df[‘x_position_um’].unique()) unique_y = sorted(df[‘y_position_um’].unique()) df[‘grid_dimensions’] = f”{len(unique_y)}x{len(unique_x)}” df[‘algorithm’] = ‘ashlar_cpu’
# Save using FileManager csv_content = df.to_csv(index=False) filemanager.save(csv_content, csv_path, “disk”) return csv_path
Configuration Integration
The materialization system integrates with the OpenHCS configuration hierarchy to control when and how materialization occurs. Configuration settings determine materialization behavior at multiple levels.
Configuration Integration: The materialization system integrates with the OpenHCS configuration hierarchy through several key configuration objects:
VFS Configuration: Controls backend selection (memory for intermediate, disk/zarr for materialization)
Step Materialization Configuration: Controls per-step materialization behavior and directory naming
Configuration Resolution: Follows the standard OpenHCS hierarchy (step → pipeline → global)
For complete configuration details and examples, see Configuration Framework.
Architectural Pattern: The configuration system provides declarative control over materialization behavior without requiring code changes. The same materialization handler can save to different backends based purely on configuration settings.
Execution Flow
The materialization system operates in two phases during pipeline execution: memory staging and persistent materialization.
Phase 1: Memory Staging (automatic):
Function Execution: Function runs and produces main output + special outputs
Memory Storage: Special outputs are automatically saved to memory backend using VFS paths
Path Planning: Compiler creates VFS paths for special outputs during compilation
Cross-Step Access: Other steps can load special outputs from memory backend using
@special_inputs
Phase 2: Persistent Materialization (conditional):
Materialization Check: System checks if special output has associated materialization spec
Data Loading: Loads special output data from memory backend
Format Conversion: Materialization handler converts data to appropriate file format
Backend Storage: Saves converted data using configured materialization backend (disk/zarr)
Path Return: Returns path to materialized file for logging/reference
Execution Example:
# During pipeline execution:
# 1. Function executes
@special_outputs(("cell_counts", materialize_cell_counts))
def count_cells(image_stack):
return processed_image, cell_count_results
# 2. Automatic memory staging
# - processed_image → memory backend (standard pipeline flow)
# - cell_count_results → memory backend at VFS path "/memory/step_output/cell_counts.pkl"
# 3. Materialization execution (if materialization spec exists)
# - Load cell_count_results from memory backend
# - Call materialize_cell_counts(cell_count_results, "/memory/step_output/cell_counts.pkl", filemanager)
# - Save CSV/JSON files to disk backend
# - Log materialization completion
Configuration-Driven Behavior: The materialization backend (disk vs zarr) is determined by the VFSConfig.materialization_backend setting, allowing the same materialization handler to save to different storage formats based on configuration.
System Integration Patterns
The storage and memory systems work together to provide seamless data flow through complex processing pipelines.
The Integration Challenge: How do you coordinate storage decisions (where to put data) with memory type decisions (what format to use) without creating tight coupling between the systems?
The Solution: The systems are designed as orthogonal layers that can be combined independently. Storage backends handle persistence and location, while memory types handle computational format and device placement.
VFS + Memory Type Coordination
During pipeline execution, the systems coordinate to optimize data flow. The VFS manages storage locations while the memory type system handles computational formats.
Standard Pipeline Flow: OpenHCS uses a consistent pattern where intermediate results are stored in the memory backend for speed, while final results are materialized to either disk or zarr backend based on user preference.
Real OpenHCS Pipeline Example:
# Actual OpenHCS FunctionStep API
pipeline = [
# Step 1: Load from disk → process → store in memory
FunctionStep(func="gaussian_filter", sigma=2.0),
# Step 2: Load from memory → GPU processing → store in memory
FunctionStep(func="binary_opening", footprint=disk(3)),
# Step 3: Load from memory → process → materialize to disk/zarr
FunctionStep(func="label", connectivity=2)
]
Backend Usage Pattern: - Input: Always loaded from disk backend (original TIFF files) - Intermediate: Always stored in memory backend for speed - Output: Materialized to disk backend (.tiff files) or zarr backend (compressed chunks) - Special Outputs: Staged in memory backend, optionally materialized to persistent storage
Coordination Benefits: The VFS handles where data lives, the memory type system handles what format it’s in, and the materialization system handles conversion to persistent formats. The integration layer coordinates between all three to minimize unnecessary conversions and data movement.
FileManager Advanced Operations
The FileManager provides comprehensive file and directory operations beyond basic save/load functionality, with automatic backend selection and natural sorting integration.
Batch Operations:
# Batch loading for performance
file_paths = ["/memory/image_001.pkl", "/memory/image_002.pkl", "/memory/image_003.pkl"]
batch_data = filemanager.load_batch(file_paths, Backend.MEMORY.value)
# Batch saving with backend-specific optimizations
image_arrays = [array1, array2, array3]
output_paths = ["/zarr/output_001", "/zarr/output_002", "/zarr/output_003"]
filemanager.save_batch(image_arrays, output_paths, Backend.ZARR.value,
chunk_name="well_A01", zarr_config=zarr_config)
Directory Operations:
# List files with natural sorting (handles numeric sequences correctly)
image_files = filemanager.list_image_files("/disk/plate/", Backend.DISK.value,
extensions=['.tif', '.tiff'],
recursive=True)
# Returns: ['image_001.tif', 'image_002.tif', ..., 'image_010.tif'] (not lexicographic)
# Directory listing with metadata
entries = filemanager.list_dir("/memory/step_outputs/", Backend.MEMORY.value)
# Directory mirroring with symlinks
filemanager.mirror_directory("/disk/source/", "/disk/target/", Backend.DISK.value,
overwrite_symlinks_only=True)
Advanced File Operations:
# Atomic file operations
filemanager.move("/disk/temp/file.tif", "/disk/final/file.tif", Backend.DISK.value,
replace_symlinks=False)
# Directory creation with backend validation
filemanager.ensure_directory("/zarr/new_experiment/", Backend.ZARR.value)
# File existence checking across backends
exists = filemanager.exists("/memory/intermediate_result.pkl", Backend.MEMORY.value)
Storage Registry System: The FileManager uses a metaclass-based auto-registration system for backend management:
# Backends are automatically registered when their classes are defined
from polystore.backend_registry import StorageBackendMeta, STORAGE_BACKENDS, create_storage_registry
from polystore.base import DataSink
# Example backend with automatic registration
class DiskStorageBackend(StorageBackend, metaclass=StorageBackendMeta):
"""Disk storage backend with automatic metaclass registration."""
# Backend type from enum for registration
_backend_type = Backend.DISK.value
def __init__(self):
# Backend initialization
pass
# The metaclass automatically registers the backend in STORAGE_BACKENDS
# STORAGE_BACKENDS: Dict[str, Type[DataSink]] = {} # Populated by metaclass
# Create registry with all discovered backends
def create_storage_registry() -> Dict[str, DataSink]:
"""
Create storage registry with all registered backends.
Returns:
Dictionary mapping backend types to instances
"""
# Ensure all backends are discovered
discover_all_backends()
registry = {}
for backend_type in STORAGE_BACKENDS.keys():
registry[backend_type] = get_backend_instance(backend_type)
return registry
# Global singleton storage registry
from polystore.base import storage_registry # Created at module import
Lazy Backend Discovery: The backend discovery system uses lazy imports to avoid loading GPU-heavy backends during startup:
# openhcs/io/__init__.py
def __getattr__(name: str):
"""Lazy import for GPU-heavy backends."""
if name in ['zarr', 'napari_stream', 'fiji_stream']:
# Return the class without importing the module
# Module import happens later during discover_all_backends()
return getattr(sys.modules[__name__], name)
raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
# Backend registration happens during module import via metaclass
def discover_all_backends() -> None:
"""Discover and register all available backends.
Uses importlib.import_module() to directly import GPU-heavy backend modules,
ensuring metaclass registration happens even with lazy imports.
"""
# Import GPU-heavy backends directly from their modules (not via __getattr__)
# This ensures the module is imported and metaclass registration happens
importlib.import_module('polystore.zarr')
importlib.import_module('polystore.napari_stream')
importlib.import_module('polystore.fiji_stream')
Key Design Points:
Lazy Loading: GPU-heavy backends (zarr, napari_stream, fiji_stream) are not imported during
polystoremodule loadExplicit Discovery:
discover_all_backends()usesimportlib.import_module()to trigger module importsMetaclass Registration: Backend classes register themselves via
StorageBackendMetaduring module importStartup Performance: Defers GPU library imports (ome-zarr, napari, fiji) until first use
Subprocess Runner Mode: The backend discovery system supports a special subprocess runner mode that avoids importing GPU-heavy backends:
import os
# Check if we're in subprocess runner mode
if os.getenv('OPENHCS_SUBPROCESS_NO_GPU') == '1':
# Subprocess runner mode - only import essential backends
from polystore import disk, memory
# Skips: zarr, napari_stream, fiji_stream (avoid GPU library imports)
else:
# Normal mode - discover all backends (including GPU-heavy ones)
from polystore.backend_registry import discover_all_backends
discover_all_backends()
This prevents GPU library imports in subprocess workers, which is critical for multiprocessing execution where worker processes don’t need GPU access.
Natural Sorting Integration: All file listing operations use natural sorting to handle numeric sequences correctly, preventing issues with lexicographic ordering of scientific image sequences.
Performance Optimization Strategies
The integrated system enables various optimization strategies based on the standard OpenHCS backend usage pattern:
Standard Backend Strategy: - Memory Backend: Always used for intermediate results during pipeline execution - Materialization Choice: Disk vs Zarr backend for final outputs based on use case
Backend Selection Criteria:
Disk Backend (Standard TIFF): - Easy to use with standard image analysis tools - Compatible with ImageJ, napari, Fiji out-of-the-box - Familiar format for researchers - Best for smaller datasets and standard workflows
Zarr Backend (Compressed Chunks): - Cutting-edge format with significant compression benefits - Requires custom plugins for viewing (napari-zarr, Fiji plugins) - Better for large datasets and advanced users - OME-ZARR compliance for interoperability
Additional Optimizations: 3. GPU Memory Management: Keep data on GPU across multiple processing steps to avoid CPU-GPU transfer overhead 4. Lazy Loading: Load data only when needed and in the target memory type 5. Conversion Minimization: Plan conversion paths to minimize the number of format changes
Real-World Usage Patterns
Standard Scientific Workflow: - Input images loaded from disk backend (standard TIFF files) - All intermediate processing uses memory backend for speed - Final results materialized to disk or zarr backend based on requirements
Disk Backend Use Cases: - Standard research workflows - Compatibility with existing tools (ImageJ, Fiji, napari) - Sharing results with collaborators using standard tools - Smaller datasets where compression isn’t critical
Zarr Backend Use Cases: - Large-scale experiments requiring compression - Advanced users comfortable with cutting-edge formats - Workflows requiring OME-ZARR compliance - Long-term archival with compression benefits
Zarr Chunking Strategies
OpenHCS provides two chunking strategies for zarr storage, configurable via ZarrChunkStrategy enum:
WELL Mode (Default):
zarr_config = ZarrConfig(
chunk_strategy=ZarrChunkStrategy.WELL,
compressor=ZarrCompressor.ZSTD,
compression_level=1
)
Chunk shape:
(fields, channels, z, y, x)- entire 5D array in one chunkPerformance: 40x improvement for batch operations
Use case: Loading entire wells or many files from same well
Example: 9 fields × 2 channels × 5 z-planes = 1 chunk (~180MB compressed)
Benefits: Optimal for sequential processing, minimal I/O overhead
FILE Mode:
zarr_config = ZarrConfig(
chunk_strategy=ZarrChunkStrategy.FILE,
compressor=ZarrCompressor.ZSTD,
compression_level=1
)
Chunk shape:
(1, 1, 1, y, x)- each original file is a separate chunkPerformance: Better for random access to individual files
Use case: Sparse access patterns, individual file retrieval
Example: 9 fields × 2 channels × 5 z-planes = 90 chunks (~2MB each)
Benefits: Lower memory footprint, granular access
Choosing a Strategy:
Use WELL mode (default) for: - Processing entire wells sequentially - Batch operations on multiple images - Maximum I/O performance - Standard high-content screening workflows
Use FILE mode for: - Random access to individual images - Memory-constrained environments - Sparse sampling across wells - Interactive exploration workflows
OME-ZARR Structure:
Both strategies maintain OME-ZARR HCS compliance with the same 5D array structure:
/plate_openhcs/images/
├── .zgroup # Root group metadata
├── .zattrs # Plate-level OME-ZARR metadata
├── A/ # Row A
│ ├── 01/ # Column 01 (well A01)
│ │ ├── .zgroup
│ │ └── 0/ # Field group (5D array)
│ │ ├── .zarray # Array metadata
│ │ ├── .zattrs # Well metadata + filename mapping
│ │ └── 0.0.0.0.0 # Chunk file(s)
The difference is internal: WELL mode creates one large chunk file, FILE mode creates many small chunk files.
Performance Benefits: - Automatic format handling between memory types during processing - Optimized GPU memory management across pipeline steps - Coordinated storage decisions based on data size and use case - Consistent performance regardless of memory type used for computation - Configurable chunking for optimal I/O patterns
Benefits and Design Principles
System Benefits:
Location Transparency: Same code works with any storage backend
Type Safety: Automatic conversion with validation prevents silent failures
Performance Optimization: Zero-copy GPU transfers and intelligent conversion strategies
Scalability: Handles datasets from MB to large experimental plates seamlessly
Interoperability: Works with all major scientific Python libraries (NumPy, PyTorch, CuPy, JAX, TensorFlow, pyclesperanto)
Fail-Loud Philosophy: Errors surface immediately rather than corrupting data
Automatic Materialization: Function side effects are automatically converted to appropriate file formats
Configuration-Driven Storage: Backend selection controlled by configuration hierarchy
GPU Memory Management: Automatic cleanup and optimization across all GPU frameworks
Thread-Safe Operations: Thread-local CUDA streams and device management
Natural Sorting: Correct handling of numeric sequences in scientific datasets
Batch Operations: Optimized bulk operations for high-throughput processing
Design Principles:
Orthogonal Concerns: Storage, memory type, and materialization decisions are independent
Zero-Copy Optimization: Prioritize GPU-to-GPU transfers using DLPack and CUDA Array Interface
Explicit Device Management: GPU placement is explicit and validated with automatic cleanup
Immutable Data Flow: Data transformations create new objects rather than modifying existing ones
Scientific Reproducibility: All operations are deterministic and traceable
Declarative Materialization: Functions declare their side effects, system handles storage automatically
Configuration Hierarchy: Storage behavior follows the standard OpenHCS configuration resolution
Thread-Local Isolation: Each thread maintains independent GPU contexts and memory management
Fail-Fast Validation: Memory type contracts are enforced at runtime with detailed error reporting
Natural Data Ordering: File operations respect scientific naming conventions and numeric sequences
Backend Abstraction: Unified API across memory, disk, and compressed storage backends
See Also
Configuration Framework - Configuration hierarchy that controls storage behavior
Pipeline Compilation System Architecture - How storage and memory decisions are made during compilation
The Function Pattern System - How functions declare memory type requirements
Special I/O System: Cross-Step Communication and Dict Pattern Integration - Special input/output system that uses materialization
GPU Resource Management System - GPU device management and allocation
Memory Type System and Stack Utils - Memory type system and conversions
API Reference - API reference (autogenerated from source code)
Archived Documentation
The following documents were consolidated into this unified architecture document:
memory_backend_system.rst- VFS backends and storage registryvfs_system.rst- VFS architecture and backend abstractionmemory_type_system.rst- Memory type conversion and GPU management
The materialization system content was integrated from existing documentation:
Special I/O System: Cross-Step Communication and Dict Pattern Integration - Cross-step communication patterns using materialization
Various function implementation files - Real materialization handler examples
Additional content was integrated from:
openhcs/core/memory/decorators.py- Memory type declaration systemopenhcs/core/memory/conversion_functions.py- Memory conversion implementationopenhcs/core/memory/gpu_cleanup.py- GPU memory managementopenhcs/io/filemanager.py- Advanced FileManager operationsopenhcs/io/base.py- Storage registry and backend architecture
These archived documents are available in docs/source/architecture/archive/ for reference.