Part 4 · Reliability ও Security 📖 ১২ মিনিট পড়া 📝 ২০টি কুইজ

SLA, SLO, SLI

Reliability-কে measurable করা - SRE-র ভাষা।

"আমাদের service ৯৯.৯% uptime দেয়" - এর মানে কী? কে কাকে কী promise করছে? কীভাবে measure হবে? এই তিনটি প্রশ্নের উত্তরে - SLA, SLO, SLI।

তিন term এক নজরে

SLI (Service Level Indicator): কী measure (e.g., availability %)।
SLO (Service Level Objective): Internal target (e.g., 99.95%)।
SLA (Service Level Agreement): Customer-এর সাথে contract (e.g., 99.9% - কম হলে refund)।

SLI - Service Level Indicator

SLI = measurable metric। কী track করা হবে?

Common SLIs

Availability: Uptime %।
Latency: Response time (P50, P95, P99)।
Throughput: Requests/second।
Error rate: Failed requests %।
Durability: Data persistence rate।

Good SLI properties

User experience reflect করে (server CPU না)।
Measurable।
Aggregatable।

SLO - Service Level Objective

SLO = SLI-এর উপর target। Internal commitment।

উদাহরণ

"99.95% requests শেষ হবে <200ms-এ"।
"99.9% availability per month"।
"Error rate <0.1%"।

Format

"X% of [SLI] meets [threshold] over [time window]"

SLA - Service Level Agreement

SLA = customer-এর সাথে formal contract। SLO-র loose version (buffer রেখে)। Violation-এ financial consequence।

উদাহরণ

AWS S3: 99.9% availability/month - কম হলে service credit।
Google Cloud: 99.5% - কম হলে refund।
Stripe: API uptime SLA - credits in case of breach।

💡 মনে রাখুন: SLA < SLO। কারণ SLO miss = internal alarm; SLA miss = financial loss। SLO buffer রাখে SLA-র উপরে।

"Nines" - Uptime Math

৯৯% (২ nines)

৩.৬৫ days/year downtime
৭ hours/month
Hobby project

৯৯.৯% (৩ nines)

৮.৭৬ hours/year
৪৩ minutes/month
Standard SaaS

৯৯.৯৯% (৪ nines)

৫২ minutes/year
৪.৩ minutes/month
Enterprise

৯৯.৯৯৯% (৫ nines)

৫.২৬ minutes/year
২৬ seconds/month
Telecom, banking

Error Budget

SLO 99.9% মানে - 0.1% failure allowed। এই 0.1%-ই error budget।

উদাহরণ

SLO 99.9% over 30 days = 43 minutes downtime allowed।

কী করতে পারে?

Risky feature deploy - budget থাকলে।
Maintenance/migration - budget allowable।
Budget exhausted = freeze new deployment, focus reliability।

SRE practice (Google)

Error budget - engineering reliability-এর সাথে innovation balance।

User-Centric SLI

Server uptime ≠ user happiness। User-এর experience-কে measure করুন:

"Is page loading?" (browser-perspective)।
"Is checkout completing?"।
"Is search returning results within 500ms?"।

Synthetic monitoring + real user monitoring (RUM)।

Composite SLO

System-এ multiple component:

Frontend 99.95%
API 99.9%
DB 99.99%
End-to-end multiplied: 99.95 × 99.9 × 99.99 ≈ 99.84%

Composite always < individual। Critical path optimize।

বাস্তব উদাহরণ

AWS S3: 99.99% availability, 99.999999999% durability (11 nines)।
Google Compute Engine: Multi-zone 99.99%।
Cloudflare: 100% historical uptime claim (with caveats)।
Stripe API: 99.99% with detailed status page।

SLO setup process

User journey identify (signup, checkout, search)।
Critical SLI choose (latency, availability)।
Realistic SLO target - historical data দেখুন।
Error budget calculate।
Monitoring + alerting।
Quarterly review।

সাধারণ ভুল ধারণা

"Higher SLO always better": 100% impossible + expensive। Right level choose।
"SLA = SLO": SLA legal commitment; SLO internal target।
"Server uptime = SLI": User-experience matter, not server।
"100% SLO": Reality না - SRE error budget concept-এর বিপরীত।

Best Practices

SLA < SLO (buffer রাখুন)।
User-centric SLI choose।
P50, P95, P99 latency track - average misleading।
Error budget enforce।
Status page + post-mortem।
SLO realistic - over-promise এড়ান।
Cost-benefit analysis: 99.9% → 99.99% massive cost।

📌 চ্যাপ্টার সারমর্ম

SLI = measurable metric (availability, latency)।
SLO = internal target on SLI।
SLA = customer contract; SLO-এর কম, financial consequence।
Error budget = SLO-র "allowed failure"।
Higher 9s = exponentially expensive। Right level choose।