Disaster Recovery & Incident Response Guide
Disaster Recovery & Incident Response Guide
Project: bloqr-backend Stack: Cloudflare Workers · Neon PostgreSQL · Better Auth Last Updated: 2025-07-15
Table of Contents
- 1. Architecture Overview
- 2. Recovery Targets
- 3. Neon Database Recovery
- 4. Failure Scenarios & Runbooks
- 5. Backup Strategy
- 6. Communication Plan
- 7. Post-Incident Review Template
- 8. Emergency Contacts & Links
1. Architecture Overview
The bloqr-backend production deployment runs on Cloudflare’s edge network with Neon PostgreSQL as the primary data store. All authentication flows use Better Auth, with sessions and user data persisted to Neon via Prisma.
| Component | Role |
|---|---|
| Cloudflare Workers | Serverless compute — handles API requests, compilation, and routing |
| Neon PostgreSQL | Primary database — user accounts, API keys, compilation metadata |
| Cloudflare Hyperdrive | Connection pooler — proxies and caches TCP connections to Neon |
| Cloudflare D1 | Edge-local SQLite cache — low-latency reads for hot data |
| Cloudflare KV | Distributed key-value store — rate limits, feature flags, config |
| Cloudflare R2 | Object storage — compiled filter list artifacts |
| Better Auth | Authentication framework — session management, OAuth, API keys |
flowchart TD
Client([Client / Browser]) -->|HTTPS| CF[Cloudflare Edge]
CF --> WAF[WAF / Turnstile / CF Access]
WAF --> Worker[Cloudflare Worker]
Worker -->|Auth| BA[Better Auth]
BA -->|Sessions & Users| Neon
Worker -->|Read/Write via Prisma| HD[Hyperdrive]
HD -->|Pooled TCP| Neon[(Neon PostgreSQL)]
Worker -->|Edge Cache| D1[(D1 SQLite)]
Worker -->|Config & Rate Limits| KV[(KV Store)]
Worker -->|Compiled Artifacts| R2[(R2 Storage)]
Neon -->|Continuous WAL| NeonBackup[Neon PITR Backups]
style Neon fill:#2e7d32,color:#fff
style Worker fill:#1565c0,color:#fff
style CF fill:#f57c00,color:#fff
style NeonBackup fill:#6a1b9a,color:#fff
Data Flow Summary
- Requests arrive at the Cloudflare edge and pass through WAF, Turnstile, and CF Access policies.
- The Worker authenticates via Better Auth (Clerk JWT or API key), enforces rate limits from KV, and routes the request.
- Database reads/writes go through Hyperdrive to Neon PostgreSQL. Hot data is cached in D1.
- Compiled filter list output is written to R2 for durable storage and CDN delivery.
- Neon continuously streams WAL records for point-in-time recovery.
2. Recovery Targets
| Component | RPO | RTO | Recovery Method | Notes |
|---|---|---|---|---|
| Neon PostgreSQL | ~0 (continuous WAL) | < 5 min | Point-in-time recovery | Branch from any LSN within retention window |
| D1 Edge Cache | N/A (rebuildable) | < 1 min | Auto-rebuilds from Neon | No user data — purely a read cache |
| KV Config | Last backup | < 2 min | wrangler kv bulk put | Export/import via Wrangler CLI |
| R2 Artifacts | N/A (regeneratable) | < 10 min | Recompile from filter sources | Artifacts are deterministic outputs |
| Auth Sessions | Last DB state | < 5 min | Restore Neon → sessions restored | Users re-login if session table lost |
RPO = Recovery Point Objective (max acceptable data loss) RTO = Recovery Time Objective (max acceptable downtime)
3. Neon Database Recovery
Neon provides native point-in-time recovery (PITR) via continuous WAL archiving. This is the primary disaster recovery mechanism for all persistent data.
3.1 Point-in-Time Recovery (PITR)
Neon continuously archives WAL records. You can restore to any point within the retention window (7 days on Free, 30 days on Pro).
sequenceDiagram
participant Op as Operator
participant Console as Neon Console
participant Neon as Neon Platform
participant Worker as CF Worker
Op->>Console: Open project → Branches
Op->>Console: Create branch from timestamp
Console->>Neon: Fork WAL at specified LSN
Neon-->>Console: New branch ready
Op->>Op: Verify data on new branch
Op->>Op: Update Hyperdrive connection string
Op->>Worker: Redeploy with new DATABASE_URL
Worker-->>Op: Service restored
3.2 Restore via Neon Console
- Navigate to Neon Console → select project.
- Go to Branches → Create Branch.
- Select Point in time and choose the target timestamp (before the incident).
- Name the branch (e.g.,
recovery-2025-07-15). - Copy the new branch connection string.
- Verify data integrity by connecting with
psqlor Prisma Studio. - Update the Hyperdrive binding or
DATABASE_URLsecret:
wrangler secret put DATABASE_URL# Paste the new Neon branch connection string- Redeploy the Worker:
wrangler deploy3.3 Restore via Neon API
# Create a recovery branch from a specific timestampcurl -X POST "https://console.neon.tech/api/v2/projects/${NEON_PROJECT_ID}/branches" \ -H "Authorization: Bearer ${NEON_API_KEY}" \ -H "Content-Type: application/json" \ -d '{ "branch": { "name": "recovery-'"$(date +%Y%m%d-%H%M)"'", "parent_timestamp": "2025-07-15T10:30:00Z" }, "endpoints": [{ "type": "read_write" }] }'3.4 Post-Restore Integrity Checks
After restoring, run these validation queries:
-- Verify user count is reasonableSELECT COUNT(*) FROM "user";
-- Check for orphaned sessionsSELECT COUNT(*) FROM "session" s LEFT JOIN "user" u ON s."userId" = u."id" WHERE u."id" IS NULL;
-- Verify API key integritySELECT COUNT(*) FROM "apiKey" WHERE "userId" IS NULL;
-- Check migration stateSELECT * FROM "_prisma_migrations" ORDER BY "finished_at" DESC LIMIT 5;4. Failure Scenarios & Runbooks
Scenario 1: Neon Database Unavailable
| Symptoms | 500 errors on all DB-dependent endpoints; /health returns database: unhealthy |
| Detection | Hyperdrive health check fails; error rate spike in Analytics Engine; external uptime monitor alerts |
| Severity | Critical — all write operations and authenticated reads are blocked |
Immediate Actions:
- Check Neon Status for ongoing incidents.
- Check Hyperdrive binding health in the Cloudflare dashboard.
- Verify the connection string is correct:
wrangler secret list. - If D1 cache is populated, confirm read-only mode is serving cached data.
Recovery:
- If Neon-side: wait for Neon recovery. D1 edge cache continues serving stale reads.
- If Hyperdrive-side: recreate the Hyperdrive config via Wrangler or dashboard.
- If connection string changed: update
DATABASE_URLand redeploy.
Post-Incident:
- Verify data consistency with the integrity checks in Section 3.4.
- Confirm D1 cache re-syncs from Neon.
- Review error logs for any requests that need replay.
Scenario 2: Hyperdrive Connection Pool Exhausted
| Symptoms | Intermittent 500s with connection pool exhausted or timeout acquiring connection in Worker logs |
| Detection | Worker logs via wrangler tail; latency p99 spike in Analytics Engine |
| Severity | High — partial outage, intermittent failures |
Immediate Actions:
- Check for connection leaks — ensure every request calls
prisma.$disconnect()in afinallyblock. - Inspect recent deploys for changes to database middleware or connection handling.
- Reduce inbound traffic if possible (enable maintenance mode via KV flag).
Recovery:
# Redeploy to reset all Worker instances and their connection stateswrangler deployPrevention:
- Ensure PrismaClient is scoped per-request, not global.
- Verify the
getPrismaClient()middleware creates and disposes clients correctly. - Set connection timeout in Prisma schema:
connect_timeout=10.
Scenario 3: Better Auth Secret Compromised
| Symptoms | Unauthorized session creation; session tokens appearing for non-existent users; anomalous login patterns |
| Detection | Security event telemetry in Analytics Engine; audit log anomalies; user reports of account access |
| Severity | Critical — potential data breach |
Immediate Actions (within 15 minutes):
- Rotate the compromised secret immediately:
# Generate a new secretopenssl rand -base64 32
# Update the Worker secretwrangler secret put BETTER_AUTH_SECRET# Paste the new value
# Redeploy to pick up the new secretwrangler deploy- All existing sessions are now invalidated — users must re-login.
- If
ADMIN_KEYwas also exposed, rotate it:wrangler secret put ADMIN_KEY.
Recovery:
- Audit session table for sessions created during the compromise window.
- Check for unauthorized API key creation or data exfiltration.
- Review Analytics Engine security events for the compromise timeframe.
Post-Incident:
- Notify affected users if data access is confirmed.
- Review how the secret was leaked (logs, repo, env file).
- Enable secret rotation schedule.
Scenario 4: Cloudflare Workers Outage
| Symptoms | Entire service is unreachable; all endpoints return 5xx or timeout |
| Detection | Cloudflare Status; external uptime monitors; user reports |
| Severity | Critical — total outage, but no operator action can fix it |
Immediate Actions:
- Confirm the outage via Cloudflare Status.
- There is no failover — Workers is the only compute layer. Wait for Cloudflare to resolve.
- Post a status update to users (see Section 6).
Recovery:
- Automatic when Workers recover — no redeployment needed.
- Verify
/healthendpoint returns healthy after recovery. - Check for any queued or in-flight requests that may have been dropped.
Post-Incident:
- Review request logs for the outage window.
- Assess whether a multi-provider failover strategy is warranted.
Scenario 5: Data Corruption / Bad Migration
| Symptoms | Incorrect data returned; constraint violations; Prisma query errors referencing missing columns |
| Detection | Application errors in logs; user reports; failed health checks |
| Severity | High to Critical — depends on scope of corruption |
Immediate Actions:
- Identify the bad migration in
prisma/migrations/. - Check migration history:
SELECT "migration_name", "finished_at", "applied_steps_count"FROM "_prisma_migrations"ORDER BY "finished_at" DESCLIMIT 10;- If the migration just ran, use Neon PITR to restore to the timestamp before it.
Recovery:
flowchart TD
A[Bad migration detected] --> B{Can rollback with SQL?}
B -->|Yes| C[Write corrective migration]
B -->|No| D[Use Neon PITR]
C --> E[Test on Neon branch]
D --> F[Create branch at pre-migration timestamp]
E --> G[Apply to production]
F --> G
G --> H[Verify data integrity]
H --> I[Redeploy Worker]
Prevention:
- Always test migrations on a Neon branch first:
npx prisma migrate deployagainst a branch endpoint. - Use
prisma migrate diffto preview changes before applying. - Keep rollback SQL scripts alongside each migration.
Scenario 6: ADMIN_KEY or API Token Leak
| Symptoms | Unauthorized admin operations; unknown IPs in audit logs; unexpected data modifications |
| Detection | Analytics Engine security events; audit log review; GitHub secret scanning alerts |
| Severity | Critical — full admin access compromised |
Immediate Actions (within 10 minutes):
- Rotate the leaked credential:
# Rotate ADMIN_KEYwrangler secret put ADMIN_KEY
# If Cloudflare API token leaked, revoke in dashboard:# dash.cloudflare.com → My Profile → API Tokens → Revoke
# If Neon API key leaked, revoke in Neon Console:# console.neon.tech → Account → API Keys → Revoke
# Redeploywrangler deploy- Audit all actions performed with the compromised credential.
- Check for created/modified API keys, user records, or admin settings.
Recovery:
- Revert any unauthorized changes identified in the audit.
- If data integrity is uncertain, use Neon PITR to restore to before the compromise.
Post-Incident:
- Determine the leak vector (committed to repo, logged, shared insecurely).
- Enable GitHub secret scanning if not already active.
- Review and tighten the secret rotation policy.
5. Backup Strategy
| Component | Method | Frequency | Retention | Restore Command |
|---|---|---|---|---|
| Neon PostgreSQL | Continuous WAL archiving | Continuous | 7 days (Free) / 30 days (Pro) | Neon Console → Create Branch from timestamp |
| D1 Edge Cache | Wrangler export | On-demand | N/A (rebuildable) | wrangler d1 export <DB_NAME> --output backup.sql |
| KV Config | Wrangler bulk export | Weekly (scripted) | Last 4 exports | wrangler kv bulk put --namespace-id <ID> backup.json |
| R2 Artifacts | Regeneratable from source | N/A | N/A | Recompile: POST /compile with source URLs |
| Prisma Schema | Git version control | Every commit | Full history | git checkout <ref> -- prisma/schema.prisma |
| Worker Secrets | Manual record (encrypted) | On rotation | Current + previous | wrangler secret put <NAME> |
Automated Backup Script (KV + D1)
#!/usr/bin/env bashset -euo pipefail
BACKUP_DIR="./backups/$(date +%Y-%m-%d)"mkdir -p "$BACKUP_DIR"
# Export D1wrangler d1 export bloqr-backend-db --output "$BACKUP_DIR/d1-export.sql"
# Export KV (requires namespace ID from wrangler.toml)wrangler kv key list --namespace-id "$KV_NAMESPACE_ID" \ | jq -r '.[].name' \ | while read -r key; do wrangler kv key get "$key" --namespace-id "$KV_NAMESPACE_ID" >> "$BACKUP_DIR/kv-export.json" done
echo "Backup completed: $BACKUP_DIR"6. Communication Plan
Notification Channels
| Audience | Channel | Responsibility |
|---|---|---|
| Engineering team | Slack / Discord #incidents | On-call engineer |
| Stakeholders | Email summary | Incident commander |
| End users | GitHub Discussions / Status page | Incident commander |
Status Update Templates
Investigating:
[Investigating] We are aware of issues affecting [component]. Our team is investigating. We will provide an update within 30 minutes.
Identified:
[Identified] The issue has been identified as [brief description]. We are working on a fix. ETA for resolution: [time estimate].
Monitoring:
[Monitoring] A fix has been deployed. We are monitoring the system for stability. Service should be fully restored.
Resolved:
[Resolved] The incident affecting [component] has been resolved. Duration: [X hours/minutes]. Root cause: [brief summary]. A full post-incident review will follow.
7. Post-Incident Review Template
Complete this template within 48 hours of incident resolution.
# Post-Incident Review: [INCIDENT-TITLE]
**Date**: YYYY-MM-DD**Duration**: HH:MM**Severity**: Critical / High / Medium / Low**Author**: [Name]
## Timeline| Time (UTC) | Event ||------------|-------|| HH:MM | First alert / detection || HH:MM | Investigation started || HH:MM | Root cause identified || HH:MM | Fix deployed || HH:MM | Service confirmed restored |
## Root Cause[Detailed description of what went wrong and why.]
## Impact- **Users affected**: [number or percentage]- **Data loss**: [none / description]- **Duration of degradation**: [time]- **Revenue impact**: [if applicable]
## What Went Well- [Item]
## What Could Be Improved- [Item]
## Corrective Actions| Action | Owner | Due Date | Status ||--------|-------|----------|--------|| [Action item] | [Name] | YYYY-MM-DD | Pending |
## Lessons Learned[Key takeaways for the team.]8. Emergency Contacts & Links
Status Pages
| Service | URL |
|---|---|
| Neon Status | https://neonstatus.com |
| Cloudflare Status | https://www.cloudflarestatus.com |
Management Consoles
| Service | URL |
|---|---|
| Neon Console | https://console.neon.tech |
| Cloudflare Dashboard | https://dash.cloudflare.com |
| GitHub Repository | https://github.com/jaypatrick/bloqr-compiler |
| GitHub Issues | https://github.com/jaypatrick/bloqr-compiler/issues |
Quick Reference Commands
# Check Worker healthcurl -s https://bloqr-backend.workers.dev/health | jq .
# Tail Worker logswrangler tail
# List current secrets (names only)wrangler secret list
# Rotate a secretwrangler secret put SECRET_NAME
# Redeploy the Workerwrangler deploy
# Check D1 database statuswrangler d1 info bloqr-backend-db
# Check KV namespacewrangler kv namespace listRemember: In a real incident, communicate early and often. A wrong estimate updated frequently is better than silence. Follow the runbooks, but use judgment — every incident is unique.