Building Intuition

Understanding when and how to use different OpenHCS features requires developing mental models for common patterns and use cases. This section provides practical guidance for building effective analysis workflows.

Mental Models for OpenHCS

Pipeline as Assembly Line

Think of a pipeline as an assembly line where data flows through processing stations:

Raw Images → [Normalize] → [Filter] → [Segment] → [Analyze] → Results
                 ↓           ↓          ↓          ↓
              Station 1   Station 2  Station 3  Station 4

Key insights: - Each station (step) does one specific job - Data flows automatically between stations - Multiple items (wells/sites) processed in parallel - Quality control can happen at any station

Steps as Specialized Workers

Each FunctionStep is like a specialized worker that knows how to process specific types of data:

# Worker that specializes in channel-specific analysis
channel_specialist = FunctionStep(
    func={
        '1': analyze_nuclei,     # Knows how to handle DAPI
        '2': analyze_neurites    # Knows how to handle GFP
    },
    group_by=GroupBy.CHANNEL
)

Key insights: - Workers have specific skills (function patterns) - Workers know what data they can handle (variable_components) - Complex jobs can be broken down into specialized workers - Workers can collaborate (function chains)

VFS as Smart Storage Manager

The Virtual File System acts like a smart storage manager that automatically decides where to put data:

Processing: Memory (fast access)
     ↓
Intermediate: Memory (temporary)
     ↓
Final Results: Disk/Zarr (persistent)

Key insights: - Fast storage for active work (memory) - Persistent storage for important results (disk/zarr) - Automatic optimization based on usage patterns - Transparent to analysis code

Common Usage Patterns

Site-by-Site Image Processing

Most common pattern for standard image analysis:

# Process each imaging site independently
pipeline = Pipeline([
    FunctionStep(
        func=stack_percentile_normalize,
        variable_components=[VariableComponents.SITE],
        name="normalize"
    ),
    FunctionStep(
        func=gaussian_filter,
        variable_components=[VariableComponents.SITE],
        sigma=2.0,
        name="filter"
    ),
    FunctionStep(
        func=segment_cells,
        variable_components=[VariableComponents.SITE],
        name="segment"
    )
])

When to use: Standard image processing where each site is analyzed independently.

Mental model: Each imaging position gets the same treatment, processed in parallel.

Multi-Channel Analysis Workflows

Different analysis for different fluorescent markers:

# Channel-specific analysis after common preprocessing
pipeline = Pipeline([
    # Common preprocessing for all channels
    FunctionStep(
        func=stack_percentile_normalize,
        variable_components=[VariableComponents.SITE],
        name="normalize"
    ),

    # Channel-specific analysis
    FunctionStep(
        func={
            '1': count_cells_single_channel,      # DAPI → nuclei count
            '2': skan_axon_skeletonize_and_analyze # GFP → neurite analysis
        },
        group_by=GroupBy.CHANNEL,
        variable_components=[VariableComponents.SITE],
        name="analyze"
    )
])

When to use: Multi-marker experiments where each channel represents different biological features.

Mental model: Common preparation followed by specialized analysis based on what each channel shows.

Multi-Channel Processing Workflows

Different processing for different fluorescent markers:

# Different preprocessing for different channels
pipeline = Pipeline([
    FunctionStep(
        func={
            '1': [  # DAPI channel
                (gaussian_filter, {'sigma': 1.0}),
                (tophat, {'selem_radius': 25})
            ],
            '2': [  # GFP channel
                (gaussian_filter, {'sigma': 1.5}),
                (enhance_contrast, {'percentile_range': (2, 98)}),
                (tophat, {'selem_radius': 30})
            ]
        },
        group_by=GroupBy.CHANNEL,
        variable_components=[VariableComponents.SITE],
        name="channel_preprocessing"
    ),

    # Channel-specific analysis
    FunctionStep(
        func={
            '1': (count_nuclei, {}),      # DAPI analysis
            '2': (trace_neurites, {})     # GFP analysis
        },
        group_by=GroupBy.CHANNEL,
        variable_components=[VariableComponents.SITE],
        name="analyze"
    )
])

When to use: Multi-marker experiments where each channel requires different processing and analysis.

Mental model: Channel-specific preprocessing and analysis pipelines that run in parallel.

Memory-to-Disk Materialization

Keep processing fast while saving important results:

pipeline = Pipeline([
    # Fast processing in memory
    FunctionStep(func=preprocess, name="preprocess"),
    FunctionStep(func=filter_images, name="filter"),

    # Save important intermediate results
    FunctionStep(
        func=segment_cells,
        name="segment",
        force_disk_output=True  # Save segmentation for inspection
    ),

    # Continue processing in memory
    FunctionStep(func=measure_features, name="measure"),

    # Final results automatically saved to configured backend
    FunctionStep(func=generate_summary, name="summary")
])

When to use: Long pipelines where you want to checkpoint important intermediate results.

Mental model: Fast processing with strategic checkpoints for important results.

Decision Trees for Common Scenarios

Choosing Variable Components

Do you need to process individual images?
├─ Yes → variable_components=[SITE, CHANNEL]
└─ No → Do you need channel-specific processing?
        ├─ Yes → variable_components=[SITE] + dictionary pattern
        └─ No → Do you need to combine across sites?
                ├─ Yes → variable_components=[CHANNEL]
                └─ No → variable_components=[SITE]

Choosing Function Patterns

Do different data types need different processing?
├─ Yes → Dictionary pattern with group_by
└─ No → Do you need multiple sequential operations?
        ├─ Yes → Function chain pattern
        └─ No → Single function pattern

Choosing Storage Strategy

How large is your dataset?
├─ Small (<10GB) → Memory backend for speed
├─ Medium (10-100GB) → Mixed strategy (memory + disk checkpoints)
└─ Large (>100GB) → Zarr backend with compression

Performance Optimization Patterns

GPU Memory Management

# Efficient GPU processing pattern
pipeline = Pipeline([
    # Group GPU operations together
    FunctionStep(
        func=[
            gaussian_filter,    # CuPy
            tophat,            # CuPy
            threshold_otsu     # CuPy
        ],
        name="gpu_preprocessing"
    ),

    # CPU analysis (automatic memory conversion)
    FunctionStep(
        func=count_cells_single_channel,  # NumPy
        name="cpu_analysis"
    )
])

Pattern: Group operations by memory type to minimize conversions.

Parallel Processing Optimization

# Maximize parallelization
step = FunctionStep(
    func=expensive_analysis,
    variable_components=[VariableComponents.SITE],  # More parallel groups
    name="parallel_analysis"
)

Pattern: Use fine-grained variable components for CPU-intensive operations to maximize parallel processing.

Memory Usage Optimization

# Manage memory usage in large datasets
pipeline = Pipeline([
    FunctionStep(func=large_preprocessing, name="preprocess"),

    # Free memory by saving to disk
    FunctionStep(
        func=memory_intensive_analysis,
        name="analysis",
        force_disk_output=True
    ),

    # Continue with freed memory
    FunctionStep(func=final_processing, name="final")
])

Pattern: Use strategic disk output to manage memory usage in long pipelines.

Troubleshooting Common Issues

“Out of Memory” Errors

Symptoms: GPU or CPU out of memory errors during processing.

Solutions: - Use force_disk_output=True for large intermediate results - Process fewer sites simultaneously (adjust variable_components) - Switch to CPU backend for memory-intensive operations - Use Zarr backend with compression for large datasets

Slow Processing

Symptoms: Processing takes much longer than expected.

Solutions: - Use GPU backends (CuPy, PyTorch, pyclesperanto) for large images - Group operations by memory type to minimize conversions - Use appropriate variable_components for parallelization - Check storage backend performance (SSD vs HDD)

Incorrect Results

Symptoms: Analysis produces unexpected or inconsistent results.

Solutions: - Check variable_components match your analysis intent - Verify group_by parameter for dictionary patterns - Use force_disk_output=True to inspect intermediate results - Test with small datasets first

Building Effective Workflows

Start Simple

Begin with basic patterns and add complexity gradually:

Single function steps with site-by-site processing
Add function chains for sequential operations
Introduce dictionary patterns for multi-channel analysis
Optimize storage and memory for performance

Iterate and Refine

Use OpenHCS features to iteratively improve workflows:

Add checkpoints with force_disk_output for debugging
Optimize memory usage by adjusting variable_components
Improve performance by grouping operations by backend
Add condition-specific processing as experiments become more complex

Test at Scale

Validate workflows with realistic datasets:

Test with full-size images to identify memory issues
Process multiple wells to verify parallel execution
Use representative data to catch edge cases
Monitor resource usage to optimize performance

These patterns and mental models provide a foundation for building effective OpenHCS workflows that scale from simple image processing to complex multi-dimensional analysis pipelines.