DX_OBSERVABILITY_PRINCIPLES.md — Phase 4 Observability Principles¶
This document defines the authoritative observability surface of AeroDB Phase 1.
Observability exists to ensure that:
- operators can understand system state
- failures are diagnosable
- recovery is auditable
- performance regressions are visible
If implementation behavior conflicts with this document, the implementation is wrong.
Observability is explicit and synchronous.
No background telemetry.
No opaque metrics.
1. Principles¶
AeroDB observability follows strict rules:
- Explicit over implicit
- Deterministic over heuristic
- Operator-visible over internal-only
- Failures are loud
- No hidden background reporting
- No sampling
- No aggregation that hides raw values
All metrics are exact.
All events are logged.
2. Observability Surfaces¶
Phase 1 exposes observability through:
- Startup logs (stdout/stderr)
- Runtime event logs (stdout)
aerodb statscommand- Exit codes
- Explicit corruption messages
There is no HTTP metrics endpoint in Phase 1.
3. Startup Logging¶
On every startup, AeroDB MUST log:
Followed by:
Configuration¶
CONFIG_LOADED
data_dir=<path>
wal_sync_mode=fsync
max_wal_size_bytes=<value>
max_memory_bytes=<value>
Schema Load¶
Recovery¶
Then:
After replay:
Index Rebuild¶
Then:
Verification¶
Then:
Serving¶
Only after this log may the system accept requests.
4. Runtime Event Logging¶
Each operation MUST log:
Writes¶
Queries¶
Explain¶
Logs are synchronous.
No buffering.
5. Stats Command¶
Operators may query live state:
aerodb stats --config aerodb.json
````
Returns JSON:
```json
{
"documents": 1234,
"schemas": 2,
"indexes": 3,
"wal_bytes": 45678,
"snapshot_count": 1,
"last_checkpoint": "2026-02-04T11:30:00Z",
"recovery_duration_ms": 812,
"uptime_seconds": 120
}
````
---
### Metric Definitions
| Field | Meaning |
| -------------------- | ------------------- |
| documents | Live document count |
| schemas | Registered schemas |
| indexes | Active indexes |
| wal_bytes | Current WAL size |
| snapshot_count | Valid snapshots |
| last_checkpoint | Timestamp |
| recovery_duration_ms | Last boot recovery |
| uptime_seconds | Since last start |
All values are exact.
---
## 6. Corruption Visibility
On any corruption:
* explicit log printed
* error code returned
* process exits
Example:
10. Determinism¶
Observability must not introduce:
- timestamps inside internal state
- nondeterministic ordering
- randomness
Logs may include wall-clock timestamps, but never affect execution.
11. Phase-1 Limitations¶
Phase 1 does NOT include:
- Prometheus
- OpenTelemetry
- structured logging frameworks
- tracing
These belong to Phase 2+.
12. Authority¶
This document governs:
- startup logging
- runtime logs
- stats output
- corruption visibility
- checkpoint reporting
- backup reporting
- restore reporting
Violations are correctness bugs.