Skip to content

FAILOVER_FAILURE_MODEL.md — Failover & Promotion

Status

  • Phase: 6
  • Authority: Normative
  • Depends on:
  • FAILOVER_VISION.md
  • FAILOVER_SCOPE.md
  • FAILOVER_INVARIANTS.md
  • FAILOVER_ARCHITECTURE.md
  • Frozen Dependencies: Phases 0–5

1. Purpose

This document defines the failure model for Phase 6.

It specifies: - What failures are assumed possible - Where failures may occur - What outcomes are required - How promotion must behave under failure

Phase 6 inherits all failure assumptions from Phases 0–5. No assumptions are weakened, removed, or narrowed.


2. Failure Model Continuity

Phase 6 assumes exactly the same failure surface as earlier phases:

  • Process may crash at any instruction
  • Power loss may occur at any time
  • Disk writes may be partial or reordered
  • fsync is the only durability boundary
  • Memory state is volatile
  • Time is not monotonic across crashes
  • No graceful shutdown is assumed

Phase 6 introduces no new reliability assumptions.


3. Promotion-Specific Failure Points

Promotion introduces new logical failure boundaries.

These boundaries must be explicitly handled.


3.1 Failure Before Promotion Validation

Examples - Crash before validation begins - Validation request rejected immediately - Operator aborts request

Required Outcome - No authority change - System remains in pre-promotion state - No recovery ambiguity


3.2 Failure During Promotion Validation

Examples - Crash while evaluating WAL position - Crash while checking replica safety - Partial validation execution

Required Outcome - Promotion is considered not attempted - No authority transition occurs - On recovery, system behaves as if promotion never started

Validation has no durable effect.


3.3 Failure After Validation, Before Authority Transition

Examples - Crash after validation succeeds - Crash before authority transition is applied

Required Outcome - Promotion is not applied - Authority remains unchanged - Validation must be re-run on retry

There is no “validated but pending” state after recovery.


3.4 Failure During Authority Transition

Examples - Crash while updating replication role - Crash while rebinding authority - Crash during atomic transition

Required Outcome - Authority transition MUST be atomic - Recovery MUST observe: - Either the old authority state - Or the new authority state - Never a mixed or ambiguous state

Partial authority transitions are forbidden.


3.5 Failure After Authority Transition Completes

Examples - Crash immediately after promotion completes - Crash before observability events are flushed

Required Outcome - New authority state is authoritative - Recovery MUST reestablish the promoted primary - Observability gaps are acceptable; authority gaps are not


4. Primary Failure Scenarios

4.1 Primary Crash Before Promotion

Scenario - Primary crashes - No promotion attempted yet

Required Outcome - System has no writable primary - Replicas remain replicas - Reads obey replica read rules - Writes are rejected

Availability loss is acceptable; unsafe promotion is not.


4.2 Primary Crash During Promotion Attempt

Scenario - Promotion is attempted - Primary crashes mid-process

Required Outcome - Promotion outcome depends solely on authority transition completion - No reliance on primary liveness - No automatic retry

Promotion safety must be provable without the primary.


5. Replica Failure Scenarios

5.1 Replica Crash Before Promotion

Required Outcome - Promotion request fails explicitly - No authority change - No recovery ambiguity


5.2 Replica Crash During Promotion

Required Outcome - Promotion fails or is rolled back - Authority remains unchanged - Recovery is deterministic


6. Split-Brain Risk Handling

Phase 6 explicitly does not tolerate split-brain.

If: - Authority safety cannot be proven - Primary liveness is ambiguous - Replica divergence is suspected

Required Outcome - Promotion MUST be rejected - System MUST fail closed

No best-effort behavior is allowed.


7. Network Failure Considerations

Phase 6 assumes: - Network partitions are possible - Messages may be delayed or lost - No reliable membership service exists

Phase 6 MUST NOT: - Infer authority from network reachability - Infer primary death from timeouts - Use network health as a promotion signal


8. Deterministic Failure Outcomes

For any failure scenario: - Outcome MUST be deterministic - Outcome MUST be explainable - Outcome MUST map to Phase 6 invariants

Failure handling MUST NOT depend on: - Timing - Retry count - Environmental factors


9. Forbidden Failure Handling

Phase 6 MUST NOT: - Retry promotion automatically - Mask failures - Escalate privileges implicitly - Enter degraded authority modes - “Guess” the safest outcome

Explicit failure is always preferred.


10. Testing Obligations

Phase 6 failure handling MUST be validated by:

  • Crash tests at every promotion boundary
  • Recovery tests for each failure scenario
  • Deterministic replay verification
  • Invariant enforcement tests

All Phase 0–5 tests MUST pass unmodified.


11. Failure Model Completeness Rule

The failure model is complete when:

  • Every promotion boundary is a failure boundary
  • Every failure produces a single valid outcome
  • No recovery ambiguity exists
  • No silent behavior exists

END OF DOCUMENT