SLA, SLO, SLI
Reliability-কে measurable করা — SRE-র ভাষা।
"আমাদের service ৯৯.৯% uptime দেয়" — এর মানে কী? কে কাকে কী promise করছে? কীভাবে measure হবে? এই তিনটি প্রশ্নের উত্তরে — SLA, SLO, SLI।
তিন term এক নজরে
- SLI (Service Level Indicator): কী measure (e.g., availability %)।
- SLO (Service Level Objective): Internal target (e.g., 99.95%)।
- SLA (Service Level Agreement): Customer-এর সাথে contract (e.g., 99.9% — কম হলে refund)।
SLI — Service Level Indicator
SLI = measurable metric। কী track করা হবে?
Common SLIs
- Availability: Uptime %।
- Latency: Response time (P50, P95, P99)।
- Throughput: Requests/second।
- Error rate: Failed requests %।
- Durability: Data persistence rate।
Good SLI properties
- User experience reflect করে (server CPU না)।
- Measurable।
- Aggregatable।
SLO — Service Level Objective
SLO = SLI-এর উপর target। Internal commitment।
উদাহরণ
- "99.95% requests শেষ হবে <200ms-এ"।
- "99.9% availability per month"।
- "Error rate <0.1%"।
Format
"X% of [SLI] meets [threshold] over [time window]"
SLA — Service Level Agreement
SLA = customer-এর সাথে formal contract। SLO-র loose version (buffer রেখে)। Violation-এ financial consequence।
উদাহরণ
- AWS S3: 99.9% availability/month — কম হলে service credit।
- Google Cloud: 99.5% — কম হলে refund।
- Stripe: API uptime SLA — credits in case of breach।
"Nines" — Uptime Math
৯৯% (২ nines)
- ৩.৬৫ days/year downtime
- ৭ hours/month
- Hobby project
৯৯.৯% (৩ nines)
- ৮.৭৬ hours/year
- ৪৩ minutes/month
- Standard SaaS
৯৯.৯৯% (৪ nines)
- ৫২ minutes/year
- ৪.৩ minutes/month
- Enterprise
৯৯.৯৯৯% (৫ nines)
- ৫.২৬ minutes/year
- ২৬ seconds/month
- Telecom, banking
Error Budget
SLO 99.9% মানে — 0.1% failure allowed। এই 0.1%-ই error budget।
উদাহরণ
SLO 99.9% over 30 days = 43 minutes downtime allowed।
কী করতে পারে?
- Risky feature deploy — budget থাকলে।
- Maintenance/migration — budget allowable।
- Budget exhausted = freeze new deployment, focus reliability।
SRE practice (Google)
Error budget — engineering reliability-এর সাথে innovation balance।
User-Centric SLI
Server uptime ≠ user happiness। User-এর experience-কে measure করুন:
- "Is page loading?" (browser-perspective)।
- "Is checkout completing?"।
- "Is search returning results within 500ms?"।
Synthetic monitoring + real user monitoring (RUM)।
Composite SLO
System-এ multiple component:
- Frontend 99.95%
- API 99.9%
- DB 99.99%
- End-to-end multiplied: 99.95 × 99.9 × 99.99 ≈ 99.84%
Composite always < individual। Critical path optimize।
বাস্তব উদাহরণ
- AWS S3: 99.99% availability, 99.999999999% durability (11 nines)।
- Google Compute Engine: Multi-zone 99.99%।
- Cloudflare: 100% historical uptime claim (with caveats)।
- Stripe API: 99.99% with detailed status page।
SLO setup process
- User journey identify (signup, checkout, search)।
- Critical SLI choose (latency, availability)।
- Realistic SLO target — historical data দেখুন।
- Error budget calculate।
- Monitoring + alerting।
- Quarterly review।
সাধারণ ভুল ধারণা
- "Higher SLO always better": 100% impossible + expensive। Right level choose।
- "SLA = SLO": SLA legal commitment; SLO internal target।
- "Server uptime = SLI": User-experience matter, not server।
- "100% SLO": Reality না — SRE error budget concept-এর বিপরীত।
Best Practices
- SLA < SLO (buffer রাখুন)।
- User-centric SLI choose।
- P50, P95, P99 latency track — average misleading।
- Error budget enforce।
- Status page + post-mortem।
- SLO realistic — over-promise এড়ান।
- Cost-benefit analysis: 99.9% → 99.99% massive cost।
📌 চ্যাপ্টার সারমর্ম
- SLI = measurable metric (availability, latency)।
- SLO = internal target on SLI।
- SLA = customer contract; SLO-এর কম, financial consequence।
- Error budget = SLO-র "allowed failure"।
- Higher 9s = exponentially expensive। Right level choose।