Disaster Recovery
Worst-case ভেবে — fire, earthquake, region-wide outage।
২০২২ সালের জুলাইয়ে Bangladesh-এ একটি বড় ISP-এর data center-এ আগুন। হাজারো website ঘণ্টার পর ঘণ্টা down। প্রস্তুত কোম্পানিগুলো — backup region-এ failover করে চালু থাকল। অপ্রস্তুতরা — মুখ থুবড়ে পড়ল। এটাই Disaster Recovery-এর গুরুত্ব।
Disaster Recovery (DR) কী?
Disaster Recovery = একটি pre-planned strategy যা catastrophic failure (fire, flood, earthquake, region-wide outage, cyber attack) থেকে business continuity restore করে।
Disaster Types
- Natural: Earthquake, flood, hurricane।
- Hardware: Server failure, disk crash।
- Software: Bug, corrupt deployment।
- Network: ISP outage, BGP misconfig।
- Human: Accidental deletion, misconfiguration।
- Cyber: Ransomware, DDoS, breach।
- Power: Grid failure।
RPO ও RTO — দুই মূল metric
RPO (Recovery Point Objective)
"কত data হারানো acceptable?" — last good backup থেকে disaster পর্যন্ত time।
- RPO 1 hour = max ১ ঘণ্টার data loss।
- RPO 5 minutes = near-realtime backup।
- RPO 0 = no data loss (synchronous replication)।
RTO (Recovery Time Objective)
"কত সময়ের মধ্যে service-up হবে?" — disaster থেকে operation resume পর্যন্ত।
- RTO 24 hours = পরের দিন।
- RTO 1 hour = ১ ঘণ্টায় up।
- RTO 0 = instant (active-active)।
DR Strategies
১. Backup & Restore (Cold)
- Periodic backup off-site/cloud।
- Disaster-এ — new infrastructure, restore data।
- RPO: hours-days। RTO: hours-days।
- Cheapest। SMB-এর জন্য common।
২. Pilot Light
- Minimal infrastructure DR site-এ চলছে (DB replica)।
- Disaster-এ — application server scale up।
- RPO: minutes। RTO: 10s of minutes।
- Moderate cost।
৩. Warm Standby
- DR site-এ scaled-down copy চলছে।
- Disaster-এ — scale up + failover।
- RPO: seconds। RTO: minutes।
- Higher cost।
৪. Hot Standby / Active-Active
- সব region-এ full production load।
- Disaster-এ — traffic redirect (DNS/load balancer)।
- RPO: 0। RTO: seconds।
- Highest cost — 2× infrastructure।
Strategy Comparison
Backup & Restore
- Cheapest
- Hours-days RPO/RTO
- Manual recovery
- Small business
Pilot Light
- Moderate cost
- Tens of minutes RTO
- Database replica + minimal infra
- Medium business
Warm Standby
- Higher cost
- Minutes RTO
- Scaled-down running copy
- Critical apps
Active-Active
- Highest cost
- Seconds RTO, RPO 0
- Full multi-region
- Mission-critical
Backup Strategies
3-2-1 Rule
- ৩ copies of data।
- ২ different media types।
- ১ off-site।
Backup Types
- Full: Complete data copy। Slow, large।
- Incremental: Last backup থেকে change। Fast, chain-dependent।
- Differential: Last full থেকে change। Mid-ground।
- Snapshot: Point-in-time view (DB, filesystem)।
Best Practices
- Encrypted backup।
- Automated + tested।
- Geographic separation।
- Retention policy।
- Test restoration — backup নেওয়া যথেষ্ট না।
Multi-Region Architecture
Active-Passive
Primary region traffic; secondary standby। Failover-এ DNS/LB switch।
Active-Active
উভয় region traffic handle। Stateful sync challenging।
Geo-routing
User-এর কাছাকাছি region — latency কম।
Failover Mechanisms
- DNS-based: Route 53 health check, failover routing।
- BGP: Anycast IP — automatic routing।
- Application-level: Code-এ retry to secondary।
- Manual: Operator-triggered।
DR Testing
Untested DR plan = no DR plan।
- Tabletop exercise: Discussion-based scenario।
- Walkthrough: Steps verify।
- Simulation: Test environment-এ run।
- Game day: Production-এ controlled disaster (Netflix Chaos Monkey)।
বাস্তব উদাহরণ
- Netflix Chaos Engineering: Production-এ random failure inject — resilience verify।
- AWS Multi-Region: Active-active across us-east + us-west।
- Banks: Multi-DC mandatory regulation।
- Cloudflare: Global anycast — region failure invisible।
- 2017 AWS S3 outage: Many service down — multi-region পরে standard হলো।
Business Continuity Plan (BCP)
DR = technical recovery। BCP = broader plan covering people, communication, customer notification, regulatory reporting।
- Communication tree।
- Status page update।
- Customer notification।
- Regulatory reporting।
- Post-mortem।
সাধারণ ভুল ধারণা
- "Backup = DR": Backup data; DR pure recovery process।
- "Cloud auto-disaster-proof": Region outage হয়; multi-region দরকার।
- "Once setup forever": Quarterly test + update দরকার।
- "RPO 0 always good": Synchronous replication-এ massive cost।
Best Practices
- RPO/RTO defined per service criticality।
- 3-2-1 backup rule follow।
- Test restoration quarterly।
- Multi-region for critical services।
- Runbook documented।
- Chaos engineering — proactive test।
- Communication plan — status page।
- Insurance + legal aspects review।
📌 চ্যাপ্টার সারমর্ম
- DR = catastrophic failure-এ business continuity।
- RPO = data loss tolerance; RTO = downtime tolerance।
- Strategies: Backup → Pilot Light → Warm → Active-Active।
- 3-2-1 backup rule।
- Untested plan = no plan; chaos engineering practice।