Skip to content

OSSS.ai.dependencies.failure_manager

OSSS.ai.dependencies.failure_manager

Advanced failure management with cascade prevention and retry strategies.

This module provides sophisticated failure handling capabilities including cascade prevention, intelligent retry strategies, failure impact analysis, and recovery mechanisms for agent execution failures.

FailureType

Bases: Enum

Types of failures that can occur during agent execution.

CascadePreventionStrategy

Bases: Enum

Strategies for preventing failure cascades.

RetryStrategy

Bases: Enum

Retry strategies for failed agents.

RetryConfiguration

Bases: BaseModel

Configuration for retry behavior.

Migrated from dataclass to Pydantic BaseModel for enhanced validation, serialization, and integration with the OSSS Pydantic ecosystem.

calculate_delay(attempt, recent_failures=0)

Calculate delay for the given attempt number.

FailureRecord

Bases: BaseModel

Record of a failure event.

Migrated from dataclass to Pydantic BaseModel for enhanced validation, serialization, and integration with the OSSS Pydantic ecosystem.

to_dict()

Convert to dictionary representation.

FailureImpactAnalysis

Bases: BaseModel

Analysis of failure impact on the execution graph.

Migrated from dataclass to Pydantic BaseModel for enhanced validation, serialization, and integration with the OSSS Pydantic ecosystem.

get_total_affected_count()

Get total number of affected agents.

has_recovery_options()

Check if recovery options are available.

DependencyCircuitBreaker

Circuit breaker for preventing cascade failures.

can_execute()

Check if execution is allowed through the circuit breaker.

record_success()

Record a successful execution.

record_failure()

Record a failed execution.

get_state()

Get current circuit breaker state.

FailureManager

Advanced failure manager with cascade prevention and recovery strategies.

Provides comprehensive failure handling including impact analysis, cascade prevention, retry management, and recovery coordination.

configure_retry(agent_id, config)

Configure retry behavior for a specific agent.

configure_circuit_breaker(agent_id, failure_threshold=5, recovery_timeout_ms=60000)

Configure circuit breaker for a specific agent.

set_cascade_prevention(strategy)

Set the cascade prevention strategy.

add_fallback_chain(agent_id, fallback_agents)

Add fallback chain for an agent.

can_execute_agent(agent_id, context)

Check if an agent can execute given current failure state.

Returns

Tuple[bool, str] (can_execute, reason)

handle_agent_failure(agent_id, error, context, attempt_number=1) async

Handle an agent failure with comprehensive recovery strategies.

Parameters

agent_id : str ID of the failed agent error : Exception The error that occurred context : AgentContext Current execution context attempt_number : int Current attempt number

Returns

Tuple[bool, Optional[str]] (should_retry, recovery_action)

analyze_failure_impact(failed_agent, context)

Analyze the impact of an agent failure on the execution graph.

analyze_dependency_failures(agent_id, context)

Analyze how dependency failures affect a specific agent.

attempt_recovery(agent_id, recovery_action, context) async

Attempt to recover from a failure using the specified recovery action.

create_checkpoint(checkpoint_id, context)

Create a checkpoint for recovery purposes.

get_failure_statistics()

Get comprehensive failure statistics.