Files
read_book/.trae/skills/debugging/references/incident-postmortem.template.md
寒寒 455dd1f4cd feat(desktop): 实现一些功能
1. 实现任务暂停功能

2. 实现页面的国际化功能

3.优化项目的结构以及BUG

4. 优化系统架构

5. 实现一大堆的功能
2026-01-25 03:30:23 +08:00

207 lines
3.5 KiB
Markdown

---
author: Joseph OBrien
status: unpublished
updated: '2025-12-23'
version: 1.0.1
tag: skill
type: reference
parent: debugging
---
# Incident Postmortem: {{INCIDENT_TITLE}}
**Incident ID:** {{INC-XXXX}}
**Date:** {{YYYY-MM-DD}}
**Duration:** {{START_TIME}} - {{END_TIME}} ({{DURATION}})
**Severity:** {{SEV1|SEV2|SEV3|SEV4}}
**Status:** {{RESOLVED|MONITORING}}
---
## Summary
{{ONE_PARAGRAPH_SUMMARY}}
### Impact
| Metric | Value |
|--------|-------|
| Users Affected | {{N}} |
| Revenue Impact | ${{N}} |
| Requests Failed | {{N}} |
| Error Rate | {{N}}% |
| Downtime | {{DURATION}} |
---
## Timeline
| Time (UTC) | Event |
|------------|-------|
| {{HH:MM}} | {{TRIGGER_EVENT}} |
| {{HH:MM}} | Alert fired: {{ALERT_NAME}} |
| {{HH:MM}} | On-call paged |
| {{HH:MM}} | Investigation started |
| {{HH:MM}} | Root cause identified |
| {{HH:MM}} | Mitigation applied |
| {{HH:MM}} | Service recovered |
| {{HH:MM}} | Incident closed |
---
## Root Cause
{{DETAILED_ROOT_CAUSE_ANALYSIS}}
### Contributing Factors
1. {{FACTOR_1}}
2. {{FACTOR_2}}
3. {{FACTOR_3}}
### What Failed
- **Detection:** {{HOW_WAS_IT_DETECTED}}
- **Prevention:** {{WHY_WASNT_IT_PREVENTED}}
- **Response:** {{RESPONSE_GAPS}}
---
## Resolution
### Immediate Actions
1. {{ACTION_1}}
2. {{ACTION_2}}
### Mitigation Steps
```bash
{{COMMANDS_OR_STEPS_TAKEN}}
```
### Verification
- [ ] Service health restored
- [ ] Error rates normalized
- [ ] No recurring alerts
---
## Lessons Learned
### What Went Well
- {{POSITIVE_1}}
- {{POSITIVE_2}}
### What Went Wrong
- {{NEGATIVE_1}}
- {{NEGATIVE_2}}
### Where We Got Lucky
- {{LUCKY_1}}
---
## Action Items
| ID | Action | Owner | Priority | Due Date | Status |
|----|--------|-------|----------|----------|--------|
| 1 | {{ACTION}} | {{OWNER}} | {{P1-4}} | {{DATE}} | {{STATUS}} |
| 2 | {{ACTION}} | {{OWNER}} | {{P1-4}} | {{DATE}} | {{STATUS}} |
| 3 | {{ACTION}} | {{OWNER}} | {{P1-4}} | {{DATE}} | {{STATUS}} |
### Prevention
- [ ] {{PREVENTIVE_MEASURE_1}}
- [ ] {{PREVENTIVE_MEASURE_2}}
### Detection
- [ ] {{DETECTION_IMPROVEMENT_1}}
- [ ] {{DETECTION_IMPROVEMENT_2}}
### Response
- [ ] {{RESPONSE_IMPROVEMENT_1}}
- [ ] {{RESPONSE_IMPROVEMENT_2}}
---
## Technical Details
### Affected Systems
| System | Impact | Recovery |
|--------|--------|----------|
| {{SYSTEM}} | {{DESCRIPTION}} | {{TIME}} |
### Metrics During Incident
| Metric | Normal | During Incident | Peak |
|--------|--------|-----------------|------|
| Latency (p99) | {{MS}} | {{MS}} | {{MS}} |
| Error Rate | {{N}}% | {{N}}% | {{N}}% |
| CPU Usage | {{N}}% | {{N}}% | {{N}}% |
| Memory | {{N}}GB | {{N}}GB | {{N}}GB |
### Logs
```
{{RELEVANT_LOG_SNIPPETS}}
```
---
## Communication
### Internal
| Time | Channel | Message |
|------|---------|---------|
| {{TIME}} | {{SLACK/EMAIL}} | {{SUMMARY}} |
### External
| Time | Channel | Audience | Message |
|------|---------|----------|---------|
| {{TIME}} | Status Page | Customers | {{MESSAGE}} |
---
## Related Incidents
| ID | Date | Similarity |
|----|------|------------|
| {{INC-XXXX}} | {{DATE}} | {{DESCRIPTION}} |
---
## Appendix
### A. Alert Configuration
```yaml
{{ALERT_CONFIG}}
```
### B. Runbook Updates Needed
- {{RUNBOOK_UPDATE_1}}
- {{RUNBOOK_UPDATE_2}}
---
## Quality Checklist
- [ ] Timeline is complete and accurate
- [ ] Root cause clearly identified
- [ ] Impact quantified
- [ ] Action items have owners and due dates
- [ ] Lessons learned documented
- [ ] Prevention measures identified
- [ ] Related incidents linked