=================================== Orchestrator Cleanup Guarantees =================================== *Module: openhcs.core.orchestrator.orchestrator* *Status: STABLE* --- Overview ======== Worker processes accumulate memory and GPU resources when processing multiple wells sequentially. The orchestrator cleanup system guarantees resource cleanup before error returns, preventing memory leaks and zombie workers. Problem Context =============== Without guaranteed cleanup, worker processes leak resources: .. code-block:: python # Worker processes well A process_well_A() # Allocates GPU memory, creates VFS mappings # Worker processes well B process_well_B() # Allocates MORE GPU memory, MORE VFS mappings # Worker processes well C → ERROR process_well_C() # Raises exception # Without cleanup: GPU memory and VFS from A, B, C still allocated # Worker process is now a zombie with leaked resources This causes: - **Memory exhaustion**: GPU memory fills up across sequential wells - **VFS pollution**: Memory backend accumulates stale file mappings - **Zombie workers**: Processes hang with unreleased resources Solution: Cleanup Before Error Return ====================================== The orchestrator guarantees cleanup REGARDLESS of success or failure: .. code-block:: python def _execute_axis_with_sequential_combinations( pipeline_definition, axis_contexts, visualizer ): """Execute sequential combinations with guaranteed cleanup.""" for combo_idx, (context_key, frozen_context) in enumerate(axis_contexts): # Execute this combination result = _execute_single_axis_static( pipeline_definition, frozen_context, visualizer ) # CRITICAL: Clear VFS and GPU REGARDLESS of success/failure # This must happen BEFORE checking result status from polystore.base import reset_memory_backend from openhcs.core.memory.gpu_cleanup import cleanup_all_gpu_frameworks reset_memory_backend() if cleanup_all_gpu_frameworks: cleanup_all_gpu_frameworks() # NOW check if combination failed (after cleanup) if not result.success: return result # Return error with clean state **Key insight**: Cleanup happens BEFORE error return, ensuring worker process is in clean state even when errors occur. Cleanup Components ================== VFS Reset --------- The memory backend maintains a virtual file system mapping real paths to in-memory data: .. code-block:: python from polystore.base import reset_memory_backend # Clear all VFS mappings reset_memory_backend() **What this clears**: - In-memory file mappings - Cached metadata - Virtual path registrations - Backend state **Why this matters**: Without reset, VFS accumulates mappings from previous wells, causing memory growth and stale data access. GPU Framework Cleanup ---------------------- GPU frameworks (CuPy, PyTorch, TensorFlow) cache memory pools and kernels: .. code-block:: python from openhcs.core.memory.gpu_cleanup import cleanup_all_gpu_frameworks # Clear GPU memory pools and caches if cleanup_all_gpu_frameworks: cleanup_all_gpu_frameworks() **What this clears**: - CuPy memory pools - PyTorch cached allocations - TensorFlow graph caches - CUDA context state **Why this matters**: GPU memory pools grow across sequential wells. Cleanup releases memory back to GPU. Cleanup Timing Guarantees ========================== After Each Sequential Combination ---------------------------------- .. code-block:: python # Sequential processing: A → B → C for combination in sequential_combinations: result = process_combination(combination) # Cleanup AFTER each combination reset_memory_backend() cleanup_all_gpu_frameworks() # Check result AFTER cleanup if not result.success: return result # Clean state guaranteed **Guarantee**: Each combination starts with clean VFS and GPU state. Before Error Return ------------------- .. code-block:: python try: result = process_combination(combination) except Exception as e: # Cleanup BEFORE raising exception reset_memory_backend() cleanup_all_gpu_frameworks() raise # Exception raised with clean state **Guarantee**: Exceptions don't prevent cleanup. Worker process is clean even on errors. After Successful Completion ---------------------------- .. code-block:: python # All combinations successful for combination in combinations: result = process_combination(combination) reset_memory_backend() cleanup_all_gpu_frameworks() # Final cleanup after all combinations reset_memory_backend() cleanup_all_gpu_frameworks() return success_result **Guarantee**: Worker process is clean after successful completion, ready for next task. Operational Implications ======================== Memory Stability ---------------- Cleanup guarantees prevent memory growth across sequential processing: .. code-block:: python # Without cleanup # Well 1: 2GB GPU memory used # Well 2: 4GB GPU memory used (2GB + 2GB leaked) # Well 3: 6GB GPU memory used (4GB + 2GB leaked) # Well 4: OOM error (out of memory) # With cleanup # Well 1: 2GB GPU memory used → cleanup → 0GB # Well 2: 2GB GPU memory used → cleanup → 0GB # Well 3: 2GB GPU memory used → cleanup → 0GB # Well 4: 2GB GPU memory used → cleanup → 0GB **Operational benefit**: Stable memory usage enables processing large plate datasets without OOM errors. Worker Process Reuse --------------------- Clean workers can be reused for subsequent tasks: .. code-block:: python # Worker pool with 4 processes with ProcessPoolExecutor(max_workers=4) as executor: # Process 96-well plate for well in wells: future = executor.submit(process_well, well) # Worker cleanup guaranteed after each well # Same worker can process next well without restart **Operational benefit**: Worker process reuse reduces overhead from process creation/destruction. Zombie Worker Prevention ------------------------- Cleanup before error return prevents zombie workers: .. code-block:: python # Worker processes well with error try: process_well(well) except Exception as e: # Cleanup guaranteed before exception propagates reset_memory_backend() cleanup_all_gpu_frameworks() raise # Worker is clean, can be terminated or reused **Operational benefit**: No zombie workers with leaked resources. Clean shutdown guaranteed. Implementation Patterns ======================= Sequential Combination Processing ---------------------------------- .. code-block:: python def _execute_axis_with_sequential_combinations( pipeline_definition, axis_contexts, visualizer ): """Execute sequential combinations with cleanup guarantees.""" for combo_idx, (context_key, frozen_context) in enumerate(axis_contexts): # Execute combination result = _execute_single_axis_static( pipeline_definition, frozen_context, visualizer ) # CRITICAL: Cleanup BEFORE checking result reset_memory_backend() if cleanup_all_gpu_frameworks: cleanup_all_gpu_frameworks() # Check result AFTER cleanup if not result.success: return result return ExecutionResult.success(axis_id=axis_id) Multiprocessing Worker Function -------------------------------- .. code-block:: python def _worker_process_well(well_context): """Worker function with cleanup guarantees.""" try: # Process well result = process_pipeline(well_context) # Cleanup on success reset_memory_backend() cleanup_all_gpu_frameworks() return result except Exception as e: # Cleanup on error reset_memory_backend() cleanup_all_gpu_frameworks() # Return error result (don't raise to prevent worker crash) return ExecutionResult.failure(error=str(e)) ZMQ Execution Server -------------------- The ZMQ execution server ensures cleanup even for remote execution: .. code-block:: python class ZMQExecutionServer: def execute_pipeline(self, request): """Execute pipeline with cleanup guarantees.""" try: result = self.orchestrator.execute_compiled_plate(...) # Cleanup after execution reset_memory_backend() cleanup_all_gpu_frameworks() return result except Exception as e: # Cleanup on error reset_memory_backend() cleanup_all_gpu_frameworks() raise **Key insight**: Cleanup guarantees extend to remote execution via ZMQ server. Implementation Notes ==================== **🔬 Source Code**: - Sequential cleanup: ``openhcs/core/orchestrator/orchestrator.py`` (line 199) - Worker cleanup: ``openhcs/core/orchestrator/orchestrator.py`` (line 200-208) - ZMQ server cleanup: ``openhcs/runtime/zmq_execution_server.py`` (line 448) **🏗️ Architecture**: - :doc:`../concepts/memory-backend-system` - VFS architecture - :doc:`gpu-resource-management` - GPU cleanup system - :doc:`concurrency-model` - Multiprocessing architecture **📊 Performance**: - VFS reset: < 1ms (clears dictionary mappings) - GPU cleanup: 10-100ms (depends on allocated memory) - Total overhead: < 1% of well processing time Key Design Decisions ==================== **Why cleanup before error return?** Ensures worker process is in clean state even when errors occur. Prevents zombie workers with leaked resources. **Why cleanup after each sequential combination?** Sequential processing accumulates resources across combinations. Per-combination cleanup prevents memory growth. **Why separate VFS and GPU cleanup?** VFS cleanup is always safe. GPU cleanup is conditional (only if GPU frameworks are loaded). Separation enables flexibility. Common Gotchas ============== - **Don't skip cleanup on errors**: Cleanup must happen BEFORE error return - **Don't assume cleanup is automatic**: Explicit cleanup calls required - **GPU cleanup is conditional**: Check if ``cleanup_all_gpu_frameworks`` exists before calling - **VFS reset is global**: Affects all threads in worker process Debugging Cleanup Issues ======================== Symptom: Memory Growth Across Wells ------------------------------------ **Cause**: Cleanup not called or failing silently **Diagnosis**: .. code-block:: python # Add logging to verify cleanup logger.debug("Before cleanup: VFS size = ...") reset_memory_backend() logger.debug("After cleanup: VFS size = 0") **Fix**: Ensure cleanup is called after each well Symptom: Zombie Workers ------------------------ **Cause**: Exception preventing cleanup **Diagnosis**: .. code-block:: python # Check worker process state ps aux | grep python # Look for zombie processes **Fix**: Wrap cleanup in try/finally to guarantee execution Symptom: GPU OOM Errors ------------------------ **Cause**: GPU cleanup not called or ineffective **Diagnosis**: .. code-block:: python # Check GPU memory usage nvidia-smi # Monitor GPU memory across wells **Fix**: Verify ``cleanup_all_gpu_frameworks`` is called and working