Files
read_book/.trae/skills/debugging/references/incident-postmortem.template.md
寒寒 455dd1f4cd feat(desktop): 实现一些功能
1. 实现任务暂停功能

2. 实现页面的国际化功能

3.优化项目的结构以及BUG

4. 优化系统架构

5. 实现一大堆的功能
2026-01-25 03:30:23 +08:00

3.5 KiB

author, status, updated, version, tag, type, parent
author status updated version tag type parent
Joseph OBrien unpublished 2025-12-23 1.0.1 skill reference debugging

Incident Postmortem: {{INCIDENT_TITLE}}

Incident ID: {{INC-XXXX}} Date: {{YYYY-MM-DD}} Duration: {{START_TIME}} - {{END_TIME}} ({{DURATION}}) Severity: {{SEV1|SEV2|SEV3|SEV4}} Status: {{RESOLVED|MONITORING}}


Summary

{{ONE_PARAGRAPH_SUMMARY}}

Impact

Metric Value
Users Affected {{N}}
Revenue Impact ${{N}}
Requests Failed {{N}}
Error Rate {{N}}%
Downtime {{DURATION}}

Timeline

Time (UTC) Event
{{HH:MM}} {{TRIGGER_EVENT}}
{{HH:MM}} Alert fired: {{ALERT_NAME}}
{{HH:MM}} On-call paged
{{HH:MM}} Investigation started
{{HH:MM}} Root cause identified
{{HH:MM}} Mitigation applied
{{HH:MM}} Service recovered
{{HH:MM}} Incident closed

Root Cause

{{DETAILED_ROOT_CAUSE_ANALYSIS}}

Contributing Factors

  1. {{FACTOR_1}}
  2. {{FACTOR_2}}
  3. {{FACTOR_3}}

What Failed

  • Detection: {{HOW_WAS_IT_DETECTED}}
  • Prevention: {{WHY_WASNT_IT_PREVENTED}}
  • Response: {{RESPONSE_GAPS}}

Resolution

Immediate Actions

  1. {{ACTION_1}}
  2. {{ACTION_2}}

Mitigation Steps

{{COMMANDS_OR_STEPS_TAKEN}}

Verification

  • Service health restored
  • Error rates normalized
  • No recurring alerts

Lessons Learned

What Went Well

  • {{POSITIVE_1}}
  • {{POSITIVE_2}}

What Went Wrong

  • {{NEGATIVE_1}}
  • {{NEGATIVE_2}}

Where We Got Lucky

  • {{LUCKY_1}}

Action Items

ID Action Owner Priority Due Date Status
1 {{ACTION}} {{OWNER}} {{P1-4}} {{DATE}} {{STATUS}}
2 {{ACTION}} {{OWNER}} {{P1-4}} {{DATE}} {{STATUS}}
3 {{ACTION}} {{OWNER}} {{P1-4}} {{DATE}} {{STATUS}}

Prevention

  • {{PREVENTIVE_MEASURE_1}}
  • {{PREVENTIVE_MEASURE_2}}

Detection

  • {{DETECTION_IMPROVEMENT_1}}
  • {{DETECTION_IMPROVEMENT_2}}

Response

  • {{RESPONSE_IMPROVEMENT_1}}
  • {{RESPONSE_IMPROVEMENT_2}}

Technical Details

Affected Systems

System Impact Recovery
{{SYSTEM}} {{DESCRIPTION}} {{TIME}}

Metrics During Incident

Metric Normal During Incident Peak
Latency (p99) {{MS}} {{MS}} {{MS}}
Error Rate {{N}}% {{N}}% {{N}}%
CPU Usage {{N}}% {{N}}% {{N}}%
Memory {{N}}GB {{N}}GB {{N}}GB

Logs

{{RELEVANT_LOG_SNIPPETS}}

Communication

Internal

Time Channel Message
{{TIME}} {{SLACK/EMAIL}} {{SUMMARY}}

External

Time Channel Audience Message
{{TIME}} Status Page Customers {{MESSAGE}}

ID Date Similarity
{{INC-XXXX}} {{DATE}} {{DESCRIPTION}}

Appendix

A. Alert Configuration

{{ALERT_CONFIG}}

B. Runbook Updates Needed

  • {{RUNBOOK_UPDATE_1}}
  • {{RUNBOOK_UPDATE_2}}

Quality Checklist

  • Timeline is complete and accurate
  • Root cause clearly identified
  • Impact quantified
  • Action items have owners and due dates
  • Lessons learned documented
  • Prevention measures identified
  • Related incidents linked