Part 4 · Reliability ও Security 📖 ১৩ মিনিট পড়া 📝 ২০টি কুইজ

Disaster Recovery

Worst-case ভেবে - fire, earthquake, region-wide outage।

২০২২ সালের জুলাইয়ে Bangladesh-এ একটি বড় ISP-এর data center-এ আগুন। হাজারো website ঘণ্টার পর ঘণ্টা down। প্রস্তুত কোম্পানিগুলো - backup region-এ failover করে চালু থাকল। অপ্রস্তুতরা - মুখ থুবড়ে পড়ল। এটাই Disaster Recovery-এর গুরুত্ব।

Disaster Recovery (DR) কী?

Disaster Recovery = একটি pre-planned strategy যা catastrophic failure (fire, flood, earthquake, region-wide outage, cyber attack) থেকে business continuity restore করে।

Disaster Types

Natural: Earthquake, flood, hurricane।
Hardware: Server failure, disk crash।
Software: Bug, corrupt deployment।
Network: ISP outage, BGP misconfig।
Human: Accidental deletion, misconfiguration।
Cyber: Ransomware, DDoS, breach।
Power: Grid failure।

RPO ও RTO - দুই মূল metric

RPO (Recovery Point Objective)

"কত data হারানো acceptable?" - last good backup থেকে disaster পর্যন্ত time।

RPO 1 hour = max ১ ঘণ্টার data loss।
RPO 5 minutes = near-realtime backup।
RPO 0 = no data loss (synchronous replication)।

RTO (Recovery Time Objective)

"কত সময়ের মধ্যে service-up হবে?" - disaster থেকে operation resume পর্যন্ত।

RTO 24 hours = পরের দিন।
RTO 1 hour = ১ ঘণ্টায় up।
RTO 0 = instant (active-active)।

[Last backup]─────[DISASTER]─────[Recovered] ←── RPO ──→ ←── RTO ──→ data lost downtime

DR Strategies

১. Backup & Restore (Cold)

Periodic backup off-site/cloud।
Disaster-এ - new infrastructure, restore data।
RPO: hours-days। RTO: hours-days।
Cheapest। SMB-এর জন্য common।

২. Pilot Light

Minimal infrastructure DR site-এ চলছে (DB replica)।
Disaster-এ - application server scale up।
RPO: minutes। RTO: 10s of minutes।
Moderate cost।

৩. Warm Standby

DR site-এ scaled-down copy চলছে।
Disaster-এ - scale up + failover।
RPO: seconds। RTO: minutes।
Higher cost।

৪. Hot Standby / Active-Active

সব region-এ full production load।
Disaster-এ - traffic redirect (DNS/load balancer)।
RPO: 0। RTO: seconds।
Highest cost - 2× infrastructure।

Strategy Comparison

Backup & Restore

Cheapest
Hours-days RPO/RTO
Manual recovery
Small business

Pilot Light

Moderate cost
Tens of minutes RTO
Database replica + minimal infra
Medium business

Warm Standby

Higher cost
Minutes RTO
Scaled-down running copy
Critical apps

Active-Active

Highest cost
Seconds RTO, RPO 0
Full multi-region
Mission-critical

Backup Strategies

3-2-1 Rule

৩ copies of data।
২ different media types।
১ off-site।

Backup Types

Full: Complete data copy। Slow, large।
Incremental: Last backup থেকে change। Fast, chain-dependent।
Differential: Last full থেকে change। Mid-ground।
Snapshot: Point-in-time view (DB, filesystem)।

Best Practices

Encrypted backup।
Automated + tested।
Geographic separation।
Retention policy।
Test restoration - backup নেওয়া যথেষ্ট না।

Multi-Region Architecture

Active-Passive

Primary region traffic; secondary standby। Failover-এ DNS/LB switch।

Active-Active

উভয় region traffic handle। Stateful sync challenging।

Geo-routing

User-এর কাছাকাছি region - latency কম।

Failover Mechanisms

DNS-based: Route 53 health check, failover routing।
BGP: Anycast IP - automatic routing।
Application-level: Code-এ retry to secondary।
Manual: Operator-triggered।

DR Testing

Untested DR plan = no DR plan।

Tabletop exercise: Discussion-based scenario।
Walkthrough: Steps verify।
Simulation: Test environment-এ run।
Game day: Production-এ controlled disaster (Netflix Chaos Monkey)।

বাস্তব উদাহরণ

Netflix Chaos Engineering: Production-এ random failure inject - resilience verify।
AWS Multi-Region: Active-active across us-east + us-west।
Banks: Multi-DC mandatory regulation।
Cloudflare: Global anycast - region failure invisible।
2017 AWS S3 outage: Many service down - multi-region পরে standard হলো।

Business Continuity Plan (BCP)

DR = technical recovery। BCP = broader plan covering people, communication, customer notification, regulatory reporting।

Communication tree।
Status page update।
Customer notification।
Regulatory reporting।
Post-mortem।

সাধারণ ভুল ধারণা

"Backup = DR": Backup data; DR pure recovery process।
"Cloud auto-disaster-proof": Region outage হয়; multi-region দরকার।
"Once setup forever": Quarterly test + update দরকার।
"RPO 0 always good": Synchronous replication-এ massive cost।

Best Practices

RPO/RTO defined per service criticality।
3-2-1 backup rule follow।
Test restoration quarterly।
Multi-region for critical services।
Runbook documented।
Chaos engineering - proactive test।
Communication plan - status page।
Insurance + legal aspects review।

📌 চ্যাপ্টার সারমর্ম

DR = catastrophic failure-এ business continuity।
RPO = data loss tolerance; RTO = downtime tolerance।
Strategies: Backup → Pilot Light → Warm → Active-Active।
3-2-1 backup rule।
Untested plan = no plan; chaos engineering practice।