Building Intuition
Understanding when and how to use different OpenHCS features requires developing mental models for common patterns and use cases. This section provides practical guidance for building effective analysis workflows.
Mental Models for OpenHCS
Pipeline as Assembly Line
Think of a pipeline as an assembly line where data flows through processing stations:
Raw Images → [Normalize] → [Filter] → [Segment] → [Analyze] → Results
↓ ↓ ↓ ↓
Station 1 Station 2 Station 3 Station 4
Key insights: - Each station (step) does one specific job - Data flows automatically between stations - Multiple items (wells/sites) processed in parallel - Quality control can happen at any station
Steps as Specialized Workers
Each FunctionStep is like a specialized worker that knows how to process specific types of data:
# Worker that specializes in channel-specific analysis
channel_specialist = FunctionStep(
func={
'1': analyze_nuclei, # Knows how to handle DAPI
'2': analyze_neurites # Knows how to handle GFP
},
group_by=GroupBy.CHANNEL
)
Key insights: - Workers have specific skills (function patterns) - Workers know what data they can handle (variable_components) - Complex jobs can be broken down into specialized workers - Workers can collaborate (function chains)
VFS as Smart Storage Manager
The Virtual File System acts like a smart storage manager that automatically decides where to put data:
Processing: Memory (fast access)
↓
Intermediate: Memory (temporary)
↓
Final Results: Disk/Zarr (persistent)
Key insights: - Fast storage for active work (memory) - Persistent storage for important results (disk/zarr) - Automatic optimization based on usage patterns - Transparent to analysis code
Common Usage Patterns
Site-by-Site Image Processing
Most common pattern for standard image analysis:
# Process each imaging site independently
pipeline = Pipeline([
FunctionStep(
func=stack_percentile_normalize,
variable_components=[VariableComponents.SITE],
name="normalize"
),
FunctionStep(
func=gaussian_filter,
variable_components=[VariableComponents.SITE],
sigma=2.0,
name="filter"
),
FunctionStep(
func=segment_cells,
variable_components=[VariableComponents.SITE],
name="segment"
)
])
When to use: Standard image processing where each site is analyzed independently.
Mental model: Each imaging position gets the same treatment, processed in parallel.
Multi-Channel Analysis Workflows
Different analysis for different fluorescent markers:
# Channel-specific analysis after common preprocessing
pipeline = Pipeline([
# Common preprocessing for all channels
FunctionStep(
func=stack_percentile_normalize,
variable_components=[VariableComponents.SITE],
name="normalize"
),
# Channel-specific analysis
FunctionStep(
func={
'1': count_cells_single_channel, # DAPI → nuclei count
'2': skan_axon_skeletonize_and_analyze # GFP → neurite analysis
},
group_by=GroupBy.CHANNEL,
variable_components=[VariableComponents.SITE],
name="analyze"
)
])
When to use: Multi-marker experiments where each channel represents different biological features.
Mental model: Common preparation followed by specialized analysis based on what each channel shows.
Multi-Channel Processing Workflows
Different processing for different fluorescent markers:
# Different preprocessing for different channels
pipeline = Pipeline([
FunctionStep(
func={
'1': [ # DAPI channel
(gaussian_filter, {'sigma': 1.0}),
(tophat, {'selem_radius': 25})
],
'2': [ # GFP channel
(gaussian_filter, {'sigma': 1.5}),
(enhance_contrast, {'percentile_range': (2, 98)}),
(tophat, {'selem_radius': 30})
]
},
group_by=GroupBy.CHANNEL,
variable_components=[VariableComponents.SITE],
name="channel_preprocessing"
),
# Channel-specific analysis
FunctionStep(
func={
'1': (count_nuclei, {}), # DAPI analysis
'2': (trace_neurites, {}) # GFP analysis
},
group_by=GroupBy.CHANNEL,
variable_components=[VariableComponents.SITE],
name="analyze"
)
])
When to use: Multi-marker experiments where each channel requires different processing and analysis.
Mental model: Channel-specific preprocessing and analysis pipelines that run in parallel.
Memory-to-Disk Materialization
Keep processing fast while saving important results:
pipeline = Pipeline([
# Fast processing in memory
FunctionStep(func=preprocess, name="preprocess"),
FunctionStep(func=filter_images, name="filter"),
# Save important intermediate results
FunctionStep(
func=segment_cells,
name="segment",
force_disk_output=True # Save segmentation for inspection
),
# Continue processing in memory
FunctionStep(func=measure_features, name="measure"),
# Final results automatically saved to configured backend
FunctionStep(func=generate_summary, name="summary")
])
When to use: Long pipelines where you want to checkpoint important intermediate results.
Mental model: Fast processing with strategic checkpoints for important results.
Decision Trees for Common Scenarios
Choosing Variable Components
Do you need to process individual images?
├─ Yes → variable_components=[SITE, CHANNEL]
└─ No → Do you need channel-specific processing?
├─ Yes → variable_components=[SITE] + dictionary pattern
└─ No → Do you need to combine across sites?
├─ Yes → variable_components=[CHANNEL]
└─ No → variable_components=[SITE]
Choosing Function Patterns
Do different data types need different processing?
├─ Yes → Dictionary pattern with group_by
└─ No → Do you need multiple sequential operations?
├─ Yes → Function chain pattern
└─ No → Single function pattern
Choosing Storage Strategy
How large is your dataset?
├─ Small (<10GB) → Memory backend for speed
├─ Medium (10-100GB) → Mixed strategy (memory + disk checkpoints)
└─ Large (>100GB) → Zarr backend with compression
Performance Optimization Patterns
GPU Memory Management
# Efficient GPU processing pattern
pipeline = Pipeline([
# Group GPU operations together
FunctionStep(
func=[
gaussian_filter, # CuPy
tophat, # CuPy
threshold_otsu # CuPy
],
name="gpu_preprocessing"
),
# CPU analysis (automatic memory conversion)
FunctionStep(
func=count_cells_single_channel, # NumPy
name="cpu_analysis"
)
])
Pattern: Group operations by memory type to minimize conversions.
Parallel Processing Optimization
# Maximize parallelization
step = FunctionStep(
func=expensive_analysis,
variable_components=[VariableComponents.SITE], # More parallel groups
name="parallel_analysis"
)
Pattern: Use fine-grained variable components for CPU-intensive operations to maximize parallel processing.
Memory Usage Optimization
# Manage memory usage in large datasets
pipeline = Pipeline([
FunctionStep(func=large_preprocessing, name="preprocess"),
# Free memory by saving to disk
FunctionStep(
func=memory_intensive_analysis,
name="analysis",
force_disk_output=True
),
# Continue with freed memory
FunctionStep(func=final_processing, name="final")
])
Pattern: Use strategic disk output to manage memory usage in long pipelines.
Troubleshooting Common Issues
“Out of Memory” Errors
Symptoms: GPU or CPU out of memory errors during processing.
Solutions:
- Use force_disk_output=True for large intermediate results
- Process fewer sites simultaneously (adjust variable_components)
- Switch to CPU backend for memory-intensive operations
- Use Zarr backend with compression for large datasets
Slow Processing
Symptoms: Processing takes much longer than expected.
Solutions: - Use GPU backends (CuPy, PyTorch, pyclesperanto) for large images - Group operations by memory type to minimize conversions - Use appropriate variable_components for parallelization - Check storage backend performance (SSD vs HDD)
Incorrect Results
Symptoms: Analysis produces unexpected or inconsistent results.
Solutions:
- Check variable_components match your analysis intent
- Verify group_by parameter for dictionary patterns
- Use force_disk_output=True to inspect intermediate results
- Test with small datasets first
Building Effective Workflows
Start Simple
Begin with basic patterns and add complexity gradually:
Single function steps with site-by-site processing
Add function chains for sequential operations
Introduce dictionary patterns for multi-channel analysis
Optimize storage and memory for performance
Iterate and Refine
Use OpenHCS features to iteratively improve workflows:
Add checkpoints with
force_disk_outputfor debuggingOptimize memory usage by adjusting variable_components
Improve performance by grouping operations by backend
Add condition-specific processing as experiments become more complex
Test at Scale
Validate workflows with realistic datasets:
Test with full-size images to identify memory issues
Process multiple wells to verify parallel execution
Use representative data to catch edge cases
Monitor resource usage to optimize performance
These patterns and mental models provide a foundation for building effective OpenHCS workflows that scale from simple image processing to complex multi-dimensional analysis pipelines.