CRASH_TESTING.md — AeroDB Crash Injection & Failure Validation (Phase 1)¶
This document defines the authoritative crash testing methodology for AeroDB Phase 1.
Crash testing validates:
- WAL durability
- storage atomicity
- recovery determinism
- checkpoint correctness
- snapshot safety
- backup/restore survivability
If implementation behavior conflicts with this document, the implementation is wrong.
Crash testing is mandatory before production usage.
1. Principles¶
Crash testing must obey:
- Deterministic reproduction
- Explicit kill points
- Zero partial recovery
- Exact post-crash validation
- No silent failures
Crashes are intentional and controlled.
2. Crash Types¶
The following crash modes MUST be tested:
| Crash Type | Description |
|---|---|
| SIGKILL | Immediate process termination |
| Power Loss | Simulated abrupt filesystem stop |
| Panic | Rust panic |
| Disk Error | Forced IO failure |
Each crash type must be validated independently.
3. Kill Points¶
Crash injection must be supported at the following points:
WAL¶
- after record append
- before fsync
- after fsync
Storage¶
- before document write
- after write, before checksum
- after checksum
Index¶
- during rebuild
- during update
Snapshot¶
- during storage copy
- before manifest write
- after manifest write
Checkpoint¶
- after snapshot
- before WAL truncation
- after WAL truncation
Restore¶
- after extraction
- before directory swap
- after directory swap
4. Crash Injection Mechanism¶
Testing harness must support:
Example:
When set:
- process terminates immediately at that point
Crash points must be deterministic and reproducible.
5. Required Test Scenarios¶
5.1 WAL Durability¶
Procedure:
- Insert document
- Crash after WAL fsync
- Restart
Expected:
- document exists
5.2 WAL Pre-Fsync Crash¶
- Insert document
- Crash before WAL fsync
- Restart
Expected:
- document does NOT exist
5.3 Storage Crash¶
- Insert document
- Crash during storage write
- Restart
Expected:
- either old or new document
- never corrupted state
5.4 Index Rebuild Crash¶
- Populate data
- Crash during index rebuild
- Restart
Expected:
- recovery completes
- indexes rebuilt cleanly
5.5 Snapshot Crash¶
Crash during snapshot creation.
Expected:
- snapshot ignored
- WAL recovery used
5.6 Checkpoint Crash¶
Crash after snapshot but before WAL truncation.
Expected:
- snapshot used
- WAL replayed
5.7 Restore Crash¶
Crash during restore.
Expected:
- either original or restored data_dir exists
- never partial mix
6. Post-Crash Validation¶
After each crash:
Must verify:
- storage checksums valid
- schemas loaded
- indexes rebuilt
- queries deterministic
- no partial documents
All invariants must hold.
7. Automation¶
Crash tests must be automated via:
- subprocess execution
- environment variables
- filesystem inspection
Tests belong in:
8. Failure Criteria¶
Any of the following is unacceptable:
- corrupted document state
- missing acknowledged writes
- partial records
- inconsistent indexes
- silent recovery
- non-deterministic results
Any violation is a blocking defect.
9. Phase-1 Limitations¶
Crash testing does NOT include:
- network failures
- distributed faults
- replica divergence
These belong to Phase 2.
10. Authority¶
This document governs:
- crash injection
- recovery validation
- checkpoint correctness
- snapshot reliability
- restore survivability
Violations are correctness bugs.