CRASH_TESTING.md — AeroDB Crash Injection & Failure Validation (Phase 1)¶

This document defines the authoritative crash testing methodology for AeroDB Phase 1.

Crash testing validates:

WAL durability
storage atomicity
recovery determinism
checkpoint correctness
snapshot safety
backup/restore survivability

If implementation behavior conflicts with this document, the implementation is wrong.

Crash testing is mandatory before production usage.

1. Principles¶

Crash testing must obey:

Deterministic reproduction
Explicit kill points
Zero partial recovery
Exact post-crash validation
No silent failures

Crashes are intentional and controlled.

2. Crash Types¶

The following crash modes MUST be tested:

Crash Type	Description
SIGKILL	Immediate process termination
Power Loss	Simulated abrupt filesystem stop
Panic	Rust panic
Disk Error	Forced IO failure

Each crash type must be validated independently.

3. Kill Points¶

Crash injection must be supported at the following points:

WAL¶

after record append
before fsync
after fsync

Storage¶

before document write
after write, before checksum
after checksum

Index¶

during rebuild
during update

Snapshot¶

during storage copy
before manifest write
after manifest write

Checkpoint¶

after snapshot
before WAL truncation
after WAL truncation

Restore¶

after extraction
before directory swap
after directory swap

4. Crash Injection Mechanism¶

Testing harness must support:

AERODB_CRASH_POINT=<symbolic_name>

Example:

AERODB_CRASH_POINT=wal_after_fsync

When set:

process terminates immediately at that point

Crash points must be deterministic and reproducible.

5. Required Test Scenarios¶

5.1 WAL Durability¶

Procedure:

Insert document
Crash after WAL fsync
Restart

Expected:

document exists

5.2 WAL Pre-Fsync Crash¶

Insert document
Crash before WAL fsync
Restart

Expected:

document does NOT exist

5.3 Storage Crash¶

Insert document
Crash during storage write
Restart

Expected:

either old or new document
never corrupted state

5.4 Index Rebuild Crash¶

Populate data
Crash during index rebuild
Restart

Expected:

recovery completes
indexes rebuilt cleanly

5.5 Snapshot Crash¶

Crash during snapshot creation.

Expected:

snapshot ignored
WAL recovery used

5.6 Checkpoint Crash¶

Crash after snapshot but before WAL truncation.

Expected:

snapshot used
WAL replayed

5.7 Restore Crash¶

Crash during restore.

Expected:

either original or restored data_dir exists
never partial mix

6. Post-Crash Validation¶

After each crash:

Must verify:

storage checksums valid
schemas loaded
indexes rebuilt
queries deterministic
no partial documents

All invariants must hold.

7. Automation¶

Crash tests must be automated via:

subprocess execution
environment variables
filesystem inspection

Tests belong in:

tests/crash/

8. Failure Criteria¶

Any of the following is unacceptable:

corrupted document state
missing acknowledged writes
partial records
inconsistent indexes
silent recovery
non-deterministic results

Any violation is a blocking defect.

9. Phase-1 Limitations¶

Crash testing does NOT include:

network failures
distributed faults
replica divergence

These belong to Phase 2.

10. Authority¶

This document governs:

crash injection
recovery validation
checkpoint correctness
snapshot reliability
restore survivability

Violations are correctness bugs.