Pattern Detection and Microscope Integration System

The Problem: Microscope Format Diversity

High-content screening involves diverse microscope platforms (Opera Phenix, ImageXpress, MetaXpress, etc.), each with unique directory structures, filename patterns, and metadata formats. Without automatic pattern detection, users must manually specify how to find images for each microscope type, creating brittle pipelines that break when directory structures change or when switching between instruments.

The Solution: Automatic Pattern Discovery

OpenHCS implements a pattern detection system that automatically discovers image file patterns across different microscope formats. This system coordinates filename parsing, directory structure analysis, and pattern grouping to enable flexible pipeline processing without manual configuration.

See also

Pattern Grouping and Special Output Path Resolution

Detailed explanation of how pattern grouping interacts with special outputs and the dual purpose of group_by

Overview

The system works by analyzing directory structures, extracting component information from filenames, and automatically grouping images into logical units (wells, sites, channels) that match the pipeline’s component configuration.

Architecture Components

Core Components

┌─────────────────────────────────────────────────────────────┐
│                    MicroscopeHandler                        │
│  • Format-specific directory flattening                    │
│  • Filename parsing and pattern detection                  │
│  • Metadata extraction and validation                      │
│  • Post-workspace processing coordination                  │
└─────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────┐
│                PatternDiscoveryEngine                       │
│  • Auto-detection of file patterns                         │
│  • Pattern grouping by components                          │
│  • Pattern validation and instantiation                    │
│  • Cross-well pattern coordination                         │
└─────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────┐
│                 FilenameParser                              │
│  • Format-specific regex patterns                          │
│  • Component extraction (well, site, channel, z_index)     │
│  • Filename construction and validation                    │
│  • Pattern template generation                             │
└─────────────────────────────────────────────────────────────┘

Pattern Detection Flow

Phase 1: Directory Structure Analysis

Each microscope handler implements format-specific directory processing:

ImageXpress Handler

def _build_virtual_mapping(self, plate_path: Path, filemanager: FileManager) -> Path:
    """Build virtual workspace mapping for nested folder structures."""
    workspace_mapping = {}

    # Flatten TimePoint and ZStep folders virtually (no physical file operations)
    self._flatten_timepoints(plate_path, filemanager, workspace_mapping, plate_path)
    self._flatten_zsteps(plate_path, filemanager, workspace_mapping, plate_path)

    # Save virtual workspace mapping to metadata
    writer.merge_subdirectory_metadata(metadata_path, {
        self.root_dir: {
            "workspace_mapping": workspace_mapping,
            "available_backends": {"disk": True, "virtual_workspace": True}
        }
    })

    return plate_path
    subdirs = filemanager.list_dir(directory, "disk")

    for subdir in subdirs:
        match = zstep_pattern.match(subdir.name)
        if match:
            z_index = int(match.group(1))

            # 2. Move images to parent with updated z_index
            img_files = filemanager.list_image_files(subdir, "disk")
            for img_file in img_files:
                # Parse original filename
                components = self.parser.parse_filename(img_file.name)

                # Update z_index component
                components['z_index'] = z_index

                # Construct new filename with correct z_index
                new_name = self.parser.construct_filename(**components)

                # Move file to parent directory
                new_path = directory / new_name
                filemanager.move(img_file, new_path, "disk")

Opera Phenix Handler

def _prepare_workspace(self, workspace_path, filemanager):
    """Rename Opera Phenix images based on spatial layout."""

    # 1. Find Index.xml for spatial mapping
    index_xml = self.metadata_handler.find_metadata_file(workspace_path)
    spatial_mapping = self._parse_spatial_layout(index_xml)

    # 2. Find image directory
    image_dir = workspace_path / "Images"
    if not image_dir.exists():
        # Look for other common image directories
        image_dir = self._find_image_directory(workspace_path)

    # 3. Rename files based on spatial layout
    img_files = filemanager.list_image_files(image_dir, "disk")
    for img_file in img_files:
        # Parse original filename
        components = self.parser.parse_filename(img_file.name)

        # Apply spatial remapping
        if components['site'] in spatial_mapping:
            components['site'] = spatial_mapping[components['site']]

        # Construct new filename
        new_name = self.parser.construct_filename(**components)
        new_path = img_file.parent / new_name

        if new_path != img_file:
            filemanager.move(img_file, new_path, "disk")

    return image_dir

Phase 2: Pattern Discovery

The PatternDiscoveryEngine analyzes the flattened directory structure:

def auto_detect_patterns(
    self,
    folder_path: Union[str, Path],
    well_filter: List[str],
    extensions: List[str],
    group_by: Optional[str],
    variable_components: List[str],
    backend: str
) -> Dict[str, Any]:
    """Automatically detect image patterns in a folder."""

    # 1. Find and filter images by well
    files_by_well = self._find_and_filter_images(
        folder_path, well_filter, extensions, True, backend
    )

    if not files_by_well:
        return {}

    # 2. Generate patterns for each well
    result = {}
    for well, files in files_by_well.items():
        # Generate patterns from file list
        patterns = self._generate_patterns_for_files(files, variable_components)

        # Group patterns by component if requested
        result[well] = (
            self.group_patterns_by_component(patterns, component=group_by)
            if group_by else patterns
        )

    return result

def _find_and_filter_images(self, folder_path, well_filter, extensions,
                           recursive, backend):
    """Find all image files and filter by well."""

    # 1. Get all image files from directory
    image_paths = self.filemanager.list_image_files(
        folder_path, backend, extensions=extensions, recursive=recursive
    )

    # 2. Parse filenames and group by well
    files_by_well = defaultdict(list)
    for img_path in image_paths:
        filename = os.path.basename(img_path)

        # Parse filename to extract metadata
        metadata = self.parser.parse_filename(filename)
        if not metadata:
            continue

        # Filter by well
        well = metadata['well']
        if well not in well_filter:
            continue

        files_by_well[well].append(img_path)

    return files_by_well

Phase 3: Pattern Grouping

Patterns are grouped by components for processing:

def group_patterns_by_component(self, patterns, component):
    """Group patterns by a specific component (channel, site, etc.)."""

    grouped_patterns = defaultdict(list)

    for pattern in patterns:
        # Extract pattern template
        pattern_str = pattern.get_pattern()
        pattern_template = pattern_str.replace(self.PLACEHOLDER_PATTERN, '001')

        # Parse template to get component value
        metadata = self.parser.parse_filename(pattern_template)
        if not metadata or component not in metadata:
            raise ValueError(f"Missing component '{component}' in pattern: {pattern_str}")

        # Group by component value
        value = str(metadata[component])
        grouped_patterns[value].append(pattern)

    return grouped_patterns

Filename Parsing System

Parser Architecture

Each microscope format has a specialized parser:

ImageXpress Parser

class ImageXpressFilenameParser(FilenameParser):
    """Parser for ImageXpress filename format."""

    def __init__(self, filemanager, pattern_format=None):
        # Default ImageXpress pattern
        self._pattern = re.compile(
            r"([A-Z]\d{2})_s(\d+)_w(\d+)_z(\d+)\.(\w+)$"
        )
        # Groups: well, site, channel, z_index, extension

    def parse_filename(self, filename):
        """Parse ImageXpress filename into components."""
        basename = Path(filename).name
        match = self._pattern.match(basename)

        if match:
            well, site_str, channel_str, z_str, ext = match.groups()

            # Handle placeholder components
            parse_comp = lambda s: None if not s or '{' in s else int(s)

            return {
                'well': well,
                'site': parse_comp(site_str),
                'channel': parse_comp(channel_str),
                'z_index': parse_comp(z_str),
                'extension': ext or '.tif'
            }

        return None

    def construct_filename(self, well, site, channel, z_index, extension):
        """Construct filename from components."""
        return f"{well}_s{site:03d}_w{channel}_z{z_index:03d}{extension}"

Opera Phenix Parser

class OperaPhenixFilenameParser(FilenameParser):
    """Parser for Opera Phenix filename format."""

    def __init__(self, filemanager, pattern_format=None):
        # Opera Phenix pattern: r(\d+)c(\d+)f(\d+)p(\d+)-ch(\d+)sk(\d+)fk(\d+)fl(\d+)\.(\w+)
        self._pattern = re.compile(
            r"r(\d+)c(\d+)f(\d+)p(\d+)-ch(\d+)sk(\d+)fk(\d+)fl(\d+)\.(\w+)$"
        )

    def parse_filename(self, filename):
        """Parse Opera Phenix filename into components."""
        basename = Path(filename).name
        match = self._pattern.match(basename)

        if match:
            row, col, site_str, z_str, channel_str, ext = match.groups()

            # Create well ID from row and column
            well = f"R{int(row):02d}C{int(col):02d}"

            # Parse components
            parse_comp = lambda s: None if not s or '{' in s else int(s)

            return {
                'well': well,
                'site': parse_comp(site_str),
                'channel': parse_comp(channel_str),
                'wavelength': parse_comp(channel_str),  # Backward compatibility
                'z_index': parse_comp(z_str),
                'extension': ext or '.tif'
            }

        return None

Integration with Pipeline System

Post-Workspace Processing

The orchestrator calls post_workspace() after creating symlinks:

# In orchestrator.compile_pipelines()
def compile_pipelines(self):
    """Compile pipelines for all detected wells."""

    # 1. Create workspace symlinks
    self.create_workspace_symlinks()

    # 2. Process workspace with microscope handler
    actual_input_dir = self.microscope_handler.post_workspace(
        workspace_path=self.workspace_path,
        filemanager=self.filemanager
    )

    # 3. Update input directory to flattened location
    self.input_dir = actual_input_dir

    # 4. Detect patterns in processed directory
    patterns_by_well = self.microscope_handler.auto_detect_patterns(
        folder_path=self.input_dir,
        well_filter=self.wells,
        extensions=DEFAULT_IMAGE_EXTENSIONS,
        group_by="channel",
        variable_components=["site"],
        backend="disk"
    )

    # 5. Compile pipeline for each well
    for well_id in self.wells:
        if well_id in patterns_by_well:
            context = self.create_context(well_id)
            # ... compilation continues

Pattern Usage in FunctionStep

Patterns are used during step execution:

# In FunctionStep.process()
def process(self, context: 'ProcessingContext', step_index: int) -> None:
    """Execute function step using detected patterns."""

    # 1. Get step plan for this step
    step_plan = context.step_plans[step_index]
    patterns_by_well = step_plan.get('patterns_by_well', {})
        group_by, variable_components, read_backend
    )

    # 2. Resolve function patterns
    grouped_patterns, comp_to_funcs, comp_to_base_args = prepare_patterns_and_functions(
        patterns_by_well[well_id], self.func, component=group_by
    )

    # 3. Process each component group
    for comp_val, current_pattern_list in grouped_patterns.items():
        exec_func_or_chain = comp_to_funcs[comp_val]
        base_kwargs = comp_to_base_args[comp_val]

        for pattern_item in current_pattern_list:
            # Get matching files for this pattern
            matching_files = context.microscope_handler.path_list_from_pattern(
                str(step_input_dir), pattern_item, read_backend
            )

            # Load, stack, process, unstack, save
            _process_single_pattern_group(...)

Pattern Data Structures

PatternPath Objects

Patterns are represented as PatternPath objects:

class PatternPath:
    """Represents a file pattern with component placeholders."""

    def __init__(self, pattern_string):
        self.pattern = pattern_string

    def get_pattern(self):
        """Get the pattern string."""
        return self.pattern

    def is_fully_instantiated(self):
        """Check if pattern has no uninstantiated placeholders."""
        return '{' not in self.pattern and '}' not in self.pattern

Pattern Grouping Results

Pattern detection returns nested dictionaries:

# Example result structure
patterns_by_well = {
    'A01': {
        'channel_1': [
            PatternPath("A01_s{site}_w1_z{z_index}.tif"),
            # ... more patterns for channel 1
        ],
        'channel_2': [
            PatternPath("A01_s{site}_w2_z{z_index}.tif"),
            # ... more patterns for channel 2
        ]
    },
    'D02': {
        # ... patterns for well D02
    }
}

Error Handling and Validation

Pattern Validation

def validate_patterns(patterns):
    """Validate pattern structure and instantiation."""

    for pattern in patterns:
        # Check type
        if not isinstance(pattern, PatternPath):
            raise TypeError(f"Invalid pattern type: {type(pattern)}")

        # Check instantiation
        if not pattern.is_fully_instantiated():
            raise ValueError(f"Pattern contains placeholders: {pattern}")

        # Check pattern syntax
        pattern_str = pattern.get_pattern()
        if not _is_valid_pattern_syntax(pattern_str):
            raise ValueError(f"Invalid pattern syntax: {pattern_str}")

Directory Structure Validation

def validate_directory_structure(workspace_path, microscope_type):
    """Validate directory structure matches expected format."""

    if microscope_type == "imagexpress":
        # Check for TimePoint directories
        timepoint_dirs = list(workspace_path.glob("*TimePoint*"))
        if not timepoint_dirs:
            raise ValueError("ImageXpress format requires TimePoint directories")

    elif microscope_type == "opera_phenix":
        # Check for Index.xml
        index_files = list(workspace_path.glob("**/Index.xml"))
        if not index_files:
            raise ValueError("Opera Phenix format requires Index.xml file")

Performance Considerations

Performance Characteristics

# Pattern detection performance considerations:
class PatternDiscoveryEngine:
    """Pattern discovery engine with performance optimizations."""

    def __init__(self, parser: FilenameParser, filemanager: FileManager):
        self.parser = parser
        self.filemanager = filemanager

    def auto_detect_patterns(self, folder_path, **kwargs):
        """Auto-detect patterns with efficient file operations."""

        # Use FileManager for efficient directory listing
        # Breadth-first traversal for consistent ordering
        # Filter files by extension early to reduce parsing overhead

        return self._detect_patterns_optimized(folder_path, **kwargs)

Current Implementation Status

Implemented Features

  • ✅ MicroscopeHandler architecture with format-specific processing

  • ✅ PatternDiscoveryEngine for automatic pattern detection

  • ✅ FilenameParser interface with ImageXpress and Opera Phenix implementations

  • ✅ Directory structure flattening (ImageXpress Z-steps, Opera Phenix spatial remapping)

  • ✅ Pattern grouping by components (channel, site, z_index)

  • ✅ Integration with pipeline orchestrator and FunctionStep execution

  • ✅ post_workspace workflow for microscope-specific preprocessing

Future Enhancements

  1. Pattern Caching: Cache pattern detection results for performance

  2. Dynamic Parser Registration: Runtime registration of new microscope formats

  3. Parallel Pattern Detection: Multi-threaded pattern discovery for large datasets

  4. Advanced Pattern Validation: Enhanced validation of pattern consistency

  5. Lazy Pattern Loading: On-demand pattern detection for large datasets