207 lines
3.5 KiB
Markdown
207 lines
3.5 KiB
Markdown
---
|
|
author: Joseph OBrien
|
|
status: unpublished
|
|
updated: '2025-12-23'
|
|
version: 1.0.1
|
|
tag: skill
|
|
type: reference
|
|
parent: debugging
|
|
---
|
|
|
|
# Incident Postmortem: {{INCIDENT_TITLE}}
|
|
|
|
**Incident ID:** {{INC-XXXX}}
|
|
**Date:** {{YYYY-MM-DD}}
|
|
**Duration:** {{START_TIME}} - {{END_TIME}} ({{DURATION}})
|
|
**Severity:** {{SEV1|SEV2|SEV3|SEV4}}
|
|
**Status:** {{RESOLVED|MONITORING}}
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
{{ONE_PARAGRAPH_SUMMARY}}
|
|
|
|
### Impact
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Users Affected | {{N}} |
|
|
| Revenue Impact | ${{N}} |
|
|
| Requests Failed | {{N}} |
|
|
| Error Rate | {{N}}% |
|
|
| Downtime | {{DURATION}} |
|
|
|
|
---
|
|
|
|
## Timeline
|
|
|
|
| Time (UTC) | Event |
|
|
|------------|-------|
|
|
| {{HH:MM}} | {{TRIGGER_EVENT}} |
|
|
| {{HH:MM}} | Alert fired: {{ALERT_NAME}} |
|
|
| {{HH:MM}} | On-call paged |
|
|
| {{HH:MM}} | Investigation started |
|
|
| {{HH:MM}} | Root cause identified |
|
|
| {{HH:MM}} | Mitigation applied |
|
|
| {{HH:MM}} | Service recovered |
|
|
| {{HH:MM}} | Incident closed |
|
|
|
|
---
|
|
|
|
## Root Cause
|
|
|
|
{{DETAILED_ROOT_CAUSE_ANALYSIS}}
|
|
|
|
### Contributing Factors
|
|
|
|
1. {{FACTOR_1}}
|
|
2. {{FACTOR_2}}
|
|
3. {{FACTOR_3}}
|
|
|
|
### What Failed
|
|
|
|
- **Detection:** {{HOW_WAS_IT_DETECTED}}
|
|
- **Prevention:** {{WHY_WASNT_IT_PREVENTED}}
|
|
- **Response:** {{RESPONSE_GAPS}}
|
|
|
|
---
|
|
|
|
## Resolution
|
|
|
|
### Immediate Actions
|
|
|
|
1. {{ACTION_1}}
|
|
2. {{ACTION_2}}
|
|
|
|
### Mitigation Steps
|
|
|
|
```bash
|
|
{{COMMANDS_OR_STEPS_TAKEN}}
|
|
```
|
|
|
|
### Verification
|
|
|
|
- [ ] Service health restored
|
|
- [ ] Error rates normalized
|
|
- [ ] No recurring alerts
|
|
|
|
---
|
|
|
|
## Lessons Learned
|
|
|
|
### What Went Well
|
|
|
|
- {{POSITIVE_1}}
|
|
- {{POSITIVE_2}}
|
|
|
|
### What Went Wrong
|
|
|
|
- {{NEGATIVE_1}}
|
|
- {{NEGATIVE_2}}
|
|
|
|
### Where We Got Lucky
|
|
|
|
- {{LUCKY_1}}
|
|
|
|
---
|
|
|
|
## Action Items
|
|
|
|
| ID | Action | Owner | Priority | Due Date | Status |
|
|
|----|--------|-------|----------|----------|--------|
|
|
| 1 | {{ACTION}} | {{OWNER}} | {{P1-4}} | {{DATE}} | {{STATUS}} |
|
|
| 2 | {{ACTION}} | {{OWNER}} | {{P1-4}} | {{DATE}} | {{STATUS}} |
|
|
| 3 | {{ACTION}} | {{OWNER}} | {{P1-4}} | {{DATE}} | {{STATUS}} |
|
|
|
|
### Prevention
|
|
|
|
- [ ] {{PREVENTIVE_MEASURE_1}}
|
|
- [ ] {{PREVENTIVE_MEASURE_2}}
|
|
|
|
### Detection
|
|
|
|
- [ ] {{DETECTION_IMPROVEMENT_1}}
|
|
- [ ] {{DETECTION_IMPROVEMENT_2}}
|
|
|
|
### Response
|
|
|
|
- [ ] {{RESPONSE_IMPROVEMENT_1}}
|
|
- [ ] {{RESPONSE_IMPROVEMENT_2}}
|
|
|
|
---
|
|
|
|
## Technical Details
|
|
|
|
### Affected Systems
|
|
|
|
| System | Impact | Recovery |
|
|
|--------|--------|----------|
|
|
| {{SYSTEM}} | {{DESCRIPTION}} | {{TIME}} |
|
|
|
|
### Metrics During Incident
|
|
|
|
| Metric | Normal | During Incident | Peak |
|
|
|--------|--------|-----------------|------|
|
|
| Latency (p99) | {{MS}} | {{MS}} | {{MS}} |
|
|
| Error Rate | {{N}}% | {{N}}% | {{N}}% |
|
|
| CPU Usage | {{N}}% | {{N}}% | {{N}}% |
|
|
| Memory | {{N}}GB | {{N}}GB | {{N}}GB |
|
|
|
|
### Logs
|
|
|
|
```
|
|
{{RELEVANT_LOG_SNIPPETS}}
|
|
```
|
|
|
|
---
|
|
|
|
## Communication
|
|
|
|
### Internal
|
|
|
|
| Time | Channel | Message |
|
|
|------|---------|---------|
|
|
| {{TIME}} | {{SLACK/EMAIL}} | {{SUMMARY}} |
|
|
|
|
### External
|
|
|
|
| Time | Channel | Audience | Message |
|
|
|------|---------|----------|---------|
|
|
| {{TIME}} | Status Page | Customers | {{MESSAGE}} |
|
|
|
|
---
|
|
|
|
## Related Incidents
|
|
|
|
| ID | Date | Similarity |
|
|
|----|------|------------|
|
|
| {{INC-XXXX}} | {{DATE}} | {{DESCRIPTION}} |
|
|
|
|
---
|
|
|
|
## Appendix
|
|
|
|
### A. Alert Configuration
|
|
|
|
```yaml
|
|
{{ALERT_CONFIG}}
|
|
```
|
|
|
|
### B. Runbook Updates Needed
|
|
|
|
- {{RUNBOOK_UPDATE_1}}
|
|
- {{RUNBOOK_UPDATE_2}}
|
|
|
|
---
|
|
|
|
## Quality Checklist
|
|
|
|
- [ ] Timeline is complete and accurate
|
|
- [ ] Root cause clearly identified
|
|
- [ ] Impact quantified
|
|
- [ ] Action items have owners and due dates
|
|
- [ ] Lessons learned documented
|
|
- [ ] Prevention measures identified
|
|
- [ ] Related incidents linked
|