Building Intuition

Understanding when and how to use different OpenHCS features requires developing mental models for common patterns and use cases. This section provides practical guidance for building effective analysis workflows.

Mental Models for OpenHCS

Pipeline as Assembly Line

Think of a pipeline as an assembly line where data flows through processing stations:

Raw Images → [Normalize] → [Filter] → [Segment] → [Analyze] → Results
                 ↓           ↓          ↓          ↓
              Station 1   Station 2  Station 3  Station 4

Key insights: - Each station (step) does one specific job - Data flows automatically between stations - Multiple items (wells/sites) processed in parallel - Quality control can happen at any station

Steps as Specialized Workers

Each FunctionStep is like a specialized worker that knows how to process specific types of data:

# Worker that specializes in channel-specific analysis
channel_specialist = FunctionStep(
    func={
        '1': analyze_nuclei,     # Knows how to handle DAPI
        '2': analyze_neurites    # Knows how to handle GFP
    },
    group_by=GroupBy.CHANNEL
)

Key insights: - Workers have specific skills (function patterns) - Workers know what data they can handle (variable_components) - Complex jobs can be broken down into specialized workers - Workers can collaborate (function chains)

VFS as Smart Storage Manager

The Virtual File System acts like a smart storage manager that automatically decides where to put data:

Processing: Memory (fast access)
     ↓
Intermediate: Memory (temporary)
     ↓
Final Results: Disk/Zarr (persistent)

Key insights: - Fast storage for active work (memory) - Persistent storage for important results (disk/zarr) - Automatic optimization based on usage patterns - Transparent to analysis code

Common Usage Patterns

Site-by-Site Image Processing

Most common pattern for standard image analysis:

# Process each imaging site independently
pipeline = Pipeline([
    FunctionStep(
        func=stack_percentile_normalize,
        variable_components=[VariableComponents.SITE],
        name="normalize"
    ),
    FunctionStep(
        func=gaussian_filter,
        variable_components=[VariableComponents.SITE],
        sigma=2.0,
        name="filter"
    ),
    FunctionStep(
        func=segment_cells,
        variable_components=[VariableComponents.SITE],
        name="segment"
    )
])

When to use: Standard image processing where each site is analyzed independently.

Mental model: Each imaging position gets the same treatment, processed in parallel.

Multi-Channel Analysis Workflows

Different analysis for different fluorescent markers:

# Channel-specific analysis after common preprocessing
pipeline = Pipeline([
    # Common preprocessing for all channels
    FunctionStep(
        func=stack_percentile_normalize,
        variable_components=[VariableComponents.SITE],
        name="normalize"
    ),

    # Channel-specific analysis
    FunctionStep(
        func={
            '1': count_cells_single_channel,      # DAPI → nuclei count
            '2': skan_axon_skeletonize_and_analyze # GFP → neurite analysis
        },
        group_by=GroupBy.CHANNEL,
        variable_components=[VariableComponents.SITE],
        name="analyze"
    )
])

When to use: Multi-marker experiments where each channel represents different biological features.

Mental model: Common preparation followed by specialized analysis based on what each channel shows.

Multi-Channel Processing Workflows

Different processing for different fluorescent markers:

# Different preprocessing for different channels
pipeline = Pipeline([
    FunctionStep(
        func={
            '1': [  # DAPI channel
                (gaussian_filter, {'sigma': 1.0}),
                (tophat, {'selem_radius': 25})
            ],
            '2': [  # GFP channel
                (gaussian_filter, {'sigma': 1.5}),
                (enhance_contrast, {'percentile_range': (2, 98)}),
                (tophat, {'selem_radius': 30})
            ]
        },
        group_by=GroupBy.CHANNEL,
        variable_components=[VariableComponents.SITE],
        name="channel_preprocessing"
    ),

    # Channel-specific analysis
    FunctionStep(
        func={
            '1': (count_nuclei, {}),      # DAPI analysis
            '2': (trace_neurites, {})     # GFP analysis
        },
        group_by=GroupBy.CHANNEL,
        variable_components=[VariableComponents.SITE],
        name="analyze"
    )
])

When to use: Multi-marker experiments where each channel requires different processing and analysis.

Mental model: Channel-specific preprocessing and analysis pipelines that run in parallel.

Memory-to-Disk Materialization

Keep processing fast while saving important results:

pipeline = Pipeline([
    # Fast processing in memory
    FunctionStep(func=preprocess, name="preprocess"),
    FunctionStep(func=filter_images, name="filter"),

    # Save important intermediate results
    FunctionStep(
        func=segment_cells,
        name="segment",
        force_disk_output=True  # Save segmentation for inspection
    ),

    # Continue processing in memory
    FunctionStep(func=measure_features, name="measure"),

    # Final results automatically saved to configured backend
    FunctionStep(func=generate_summary, name="summary")
])

When to use: Long pipelines where you want to checkpoint important intermediate results.

Mental model: Fast processing with strategic checkpoints for important results.

Decision Trees for Common Scenarios

Choosing Variable Components

Do you need to process individual images?
├─ Yes → variable_components=[SITE, CHANNEL]
└─ No → Do you need channel-specific processing?
        ├─ Yes → variable_components=[SITE] + dictionary pattern
        └─ No → Do you need to combine across sites?
                ├─ Yes → variable_components=[CHANNEL]
                └─ No → variable_components=[SITE]

Choosing Function Patterns

Do different data types need different processing?
├─ Yes → Dictionary pattern with group_by
└─ No → Do you need multiple sequential operations?
        ├─ Yes → Function chain pattern
        └─ No → Single function pattern

Choosing Storage Strategy

How large is your dataset?
├─ Small (<10GB) → Memory backend for speed
├─ Medium (10-100GB) → Mixed strategy (memory + disk checkpoints)
└─ Large (>100GB) → Zarr backend with compression

Performance Optimization Patterns

GPU Memory Management

# Efficient GPU processing pattern
pipeline = Pipeline([
    # Group GPU operations together
    FunctionStep(
        func=[
            gaussian_filter,    # CuPy
            tophat,            # CuPy
            threshold_otsu     # CuPy
        ],
        name="gpu_preprocessing"
    ),

    # CPU analysis (automatic memory conversion)
    FunctionStep(
        func=count_cells_single_channel,  # NumPy
        name="cpu_analysis"
    )
])

Pattern: Group operations by memory type to minimize conversions.

Parallel Processing Optimization

# Maximize parallelization
step = FunctionStep(
    func=expensive_analysis,
    variable_components=[VariableComponents.SITE],  # More parallel groups
    name="parallel_analysis"
)

Pattern: Use fine-grained variable components for CPU-intensive operations to maximize parallel processing.

Memory Usage Optimization

# Manage memory usage in large datasets
pipeline = Pipeline([
    FunctionStep(func=large_preprocessing, name="preprocess"),

    # Free memory by saving to disk
    FunctionStep(
        func=memory_intensive_analysis,
        name="analysis",
        force_disk_output=True
    ),

    # Continue with freed memory
    FunctionStep(func=final_processing, name="final")
])

Pattern: Use strategic disk output to manage memory usage in long pipelines.

Troubleshooting Common Issues

“Out of Memory” Errors

Symptoms: GPU or CPU out of memory errors during processing.

Solutions: - Use force_disk_output=True for large intermediate results - Process fewer sites simultaneously (adjust variable_components) - Switch to CPU backend for memory-intensive operations - Use Zarr backend with compression for large datasets

Slow Processing

Symptoms: Processing takes much longer than expected.

Solutions: - Use GPU backends (CuPy, PyTorch, pyclesperanto) for large images - Group operations by memory type to minimize conversions - Use appropriate variable_components for parallelization - Check storage backend performance (SSD vs HDD)

Incorrect Results

Symptoms: Analysis produces unexpected or inconsistent results.

Solutions: - Check variable_components match your analysis intent - Verify group_by parameter for dictionary patterns - Use force_disk_output=True to inspect intermediate results - Test with small datasets first

Building Effective Workflows

Start Simple

Begin with basic patterns and add complexity gradually:

  1. Single function steps with site-by-site processing

  2. Add function chains for sequential operations

  3. Introduce dictionary patterns for multi-channel analysis

  4. Optimize storage and memory for performance

Iterate and Refine

Use OpenHCS features to iteratively improve workflows:

  • Add checkpoints with force_disk_output for debugging

  • Optimize memory usage by adjusting variable_components

  • Improve performance by grouping operations by backend

  • Add condition-specific processing as experiments become more complex

Test at Scale

Validate workflows with realistic datasets:

  • Test with full-size images to identify memory issues

  • Process multiple wells to verify parallel execution

  • Use representative data to catch edge cases

  • Monitor resource usage to optimize performance

These patterns and mental models provide a foundation for building effective OpenHCS workflows that scale from simple image processing to complex multi-dimensional analysis pipelines.