OSSS.ai.dependencies.failure_manager¶
OSSS.ai.dependencies.failure_manager
¶
Advanced failure management with cascade prevention and retry strategies.
This module provides sophisticated failure handling capabilities including cascade prevention, intelligent retry strategies, failure impact analysis, and recovery mechanisms for agent execution failures.
FailureType
¶
Bases: Enum
Types of failures that can occur during agent execution.
CascadePreventionStrategy
¶
Bases: Enum
Strategies for preventing failure cascades.
RetryStrategy
¶
Bases: Enum
Retry strategies for failed agents.
RetryConfiguration
¶
Bases: BaseModel
Configuration for retry behavior.
Migrated from dataclass to Pydantic BaseModel for enhanced validation, serialization, and integration with the OSSS Pydantic ecosystem.
calculate_delay(attempt, recent_failures=0)
¶
Calculate delay for the given attempt number.
FailureRecord
¶
Bases: BaseModel
Record of a failure event.
Migrated from dataclass to Pydantic BaseModel for enhanced validation, serialization, and integration with the OSSS Pydantic ecosystem.
to_dict()
¶
Convert to dictionary representation.
FailureImpactAnalysis
¶
Bases: BaseModel
Analysis of failure impact on the execution graph.
Migrated from dataclass to Pydantic BaseModel for enhanced validation, serialization, and integration with the OSSS Pydantic ecosystem.
DependencyCircuitBreaker
¶
FailureManager
¶
Advanced failure manager with cascade prevention and recovery strategies.
Provides comprehensive failure handling including impact analysis, cascade prevention, retry management, and recovery coordination.
configure_retry(agent_id, config)
¶
Configure retry behavior for a specific agent.
configure_circuit_breaker(agent_id, failure_threshold=5, recovery_timeout_ms=60000)
¶
Configure circuit breaker for a specific agent.
set_cascade_prevention(strategy)
¶
Set the cascade prevention strategy.
add_fallback_chain(agent_id, fallback_agents)
¶
Add fallback chain for an agent.
can_execute_agent(agent_id, context)
¶
Check if an agent can execute given current failure state.
Returns¶
Tuple[bool, str] (can_execute, reason)
handle_agent_failure(agent_id, error, context, attempt_number=1)
async
¶
Handle an agent failure with comprehensive recovery strategies.
Parameters¶
agent_id : str ID of the failed agent error : Exception The error that occurred context : AgentContext Current execution context attempt_number : int Current attempt number
Returns¶
Tuple[bool, Optional[str]] (should_retry, recovery_action)
analyze_failure_impact(failed_agent, context)
¶
Analyze the impact of an agent failure on the execution graph.
analyze_dependency_failures(agent_id, context)
¶
Analyze how dependency failures affect a specific agent.
attempt_recovery(agent_id, recovery_action, context)
async
¶
Attempt to recover from a failure using the specified recovery action.
create_checkpoint(checkpoint_id, context)
¶
Create a checkpoint for recovery purposes.
get_failure_statistics()
¶
Get comprehensive failure statistics.