Part 4 · Reliability ও Security 📖 ১২ মিনিট পড়া 📝 ২০টি কুইজ

SLA, SLO, SLI

Reliability-কে measurable করা — SRE-র ভাষা।

📝 কুইজে যান

"আমাদের service ৯৯.৯% uptime দেয়" — এর মানে কী? কে কাকে কী promise করছে? কীভাবে measure হবে? এই তিনটি প্রশ্নের উত্তরে — SLA, SLO, SLI

তিন term এক নজরে

  • SLI (Service Level Indicator): কী measure (e.g., availability %)।
  • SLO (Service Level Objective): Internal target (e.g., 99.95%)।
  • SLA (Service Level Agreement): Customer-এর সাথে contract (e.g., 99.9% — কম হলে refund)।

SLI — Service Level Indicator

SLI = measurable metric। কী track করা হবে?

Common SLIs

  • Availability: Uptime %।
  • Latency: Response time (P50, P95, P99)।
  • Throughput: Requests/second।
  • Error rate: Failed requests %।
  • Durability: Data persistence rate।

Good SLI properties

  • User experience reflect করে (server CPU না)।
  • Measurable।
  • Aggregatable।

SLO — Service Level Objective

SLO = SLI-এর উপর target। Internal commitment।

উদাহরণ

  • "99.95% requests শেষ হবে <200ms-এ"।
  • "99.9% availability per month"।
  • "Error rate <0.1%"।

Format

"X% of [SLI] meets [threshold] over [time window]"

SLA — Service Level Agreement

SLA = customer-এর সাথে formal contract। SLO-র loose version (buffer রেখে)। Violation-এ financial consequence।

উদাহরণ

  • AWS S3: 99.9% availability/month — কম হলে service credit।
  • Google Cloud: 99.5% — কম হলে refund।
  • Stripe: API uptime SLA — credits in case of breach।
💡 মনে রাখুন: SLA < SLO। কারণ SLO miss = internal alarm; SLA miss = financial loss। SLO buffer রাখে SLA-র উপরে।

"Nines" — Uptime Math

৯৯% (২ nines)

  • ৩.৬৫ days/year downtime
  • ৭ hours/month
  • Hobby project

৯৯.৯% (৩ nines)

  • ৮.৭৬ hours/year
  • ৪৩ minutes/month
  • Standard SaaS

৯৯.৯৯% (৪ nines)

  • ৫২ minutes/year
  • ৪.৩ minutes/month
  • Enterprise

৯৯.৯৯৯% (৫ nines)

  • ৫.২৬ minutes/year
  • ২৬ seconds/month
  • Telecom, banking

Error Budget

SLO 99.9% মানে — 0.1% failure allowed। এই 0.1%-ই error budget

উদাহরণ

SLO 99.9% over 30 days = 43 minutes downtime allowed।

কী করতে পারে?

  • Risky feature deploy — budget থাকলে।
  • Maintenance/migration — budget allowable।
  • Budget exhausted = freeze new deployment, focus reliability।

SRE practice (Google)

Error budget — engineering reliability-এর সাথে innovation balance।

User-Centric SLI

Server uptime ≠ user happiness। User-এর experience-কে measure করুন:

  • "Is page loading?" (browser-perspective)।
  • "Is checkout completing?"।
  • "Is search returning results within 500ms?"।

Synthetic monitoring + real user monitoring (RUM)।

Composite SLO

System-এ multiple component:

  • Frontend 99.95%
  • API 99.9%
  • DB 99.99%
  • End-to-end multiplied: 99.95 × 99.9 × 99.99 ≈ 99.84%

Composite always < individual। Critical path optimize।

বাস্তব উদাহরণ

  • AWS S3: 99.99% availability, 99.999999999% durability (11 nines)।
  • Google Compute Engine: Multi-zone 99.99%।
  • Cloudflare: 100% historical uptime claim (with caveats)।
  • Stripe API: 99.99% with detailed status page।

SLO setup process

  1. User journey identify (signup, checkout, search)।
  2. Critical SLI choose (latency, availability)।
  3. Realistic SLO target — historical data দেখুন।
  4. Error budget calculate।
  5. Monitoring + alerting।
  6. Quarterly review।

সাধারণ ভুল ধারণা

  1. "Higher SLO always better": 100% impossible + expensive। Right level choose।
  2. "SLA = SLO": SLA legal commitment; SLO internal target।
  3. "Server uptime = SLI": User-experience matter, not server।
  4. "100% SLO": Reality না — SRE error budget concept-এর বিপরীত।

Best Practices

  • SLA < SLO (buffer রাখুন)।
  • User-centric SLI choose।
  • P50, P95, P99 latency track — average misleading।
  • Error budget enforce।
  • Status page + post-mortem।
  • SLO realistic — over-promise এড়ান।
  • Cost-benefit analysis: 99.9% → 99.99% massive cost।

📌 চ্যাপ্টার সারমর্ম

  • SLI = measurable metric (availability, latency)।
  • SLO = internal target on SLI।
  • SLA = customer contract; SLO-এর কম, financial consequence।
  • Error budget = SLO-র "allowed failure"।
  • Higher 9s = exponentially expensive। Right level choose।