From a8dae187f93fb3b3fcbaa2d5cac4e1591bff30b3 Mon Sep 17 00:00:00 2001 From: Carlo <1778532+cdcme@users.noreply.github.com> Date: Tue, 7 Apr 2026 11:26:54 -0400 Subject: [PATCH] Update handbook incident response sections (#43049) --- handbook/company/product-groups.md | 2 ++ handbook/engineering/README.md | 3 ++- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/handbook/company/product-groups.md b/handbook/company/product-groups.md index 535dffd459..6c45e28013 100644 --- a/handbook/company/product-groups.md +++ b/handbook/company/product-groups.md @@ -825,6 +825,8 @@ Each product group maintains two engineers assigned to incident on-call. Enginee #### Incident on-call responsibilities +The incident on-call engineer leads the incident from acknowledgment through resolution and owns internal communication. Don't assume anyone is already aware. Mention the right people in the incident channel to pull them in: someone from CS (the reporter, if CS reported the issue), the relevant engineering manager, and any engineers or QA needed for investigation. Post regular status updates in the incident channel and keep the incident response issue up to date. + **Outside of business hours** The incident on-call engineer is responsible for stabilizing the issue well enough to pick it back up in the morning. They should file P1 issues for any immediate follow-up items that need to be addressed during the next business day. diff --git a/handbook/engineering/README.md b/handbook/engineering/README.md index fb57e34d56..a21eb0c34f 100644 --- a/handbook/engineering/README.md +++ b/handbook/engineering/README.md @@ -176,6 +176,7 @@ The incident on-call engineer is responsible for: - Knowing [the incident on-call rotation](https://fleetdm.com/handbook/company/product-groups#incident-on-call-engineer). - Completing the [incident.io on-call engineer onboarding steps](https://help.incident.io/articles/3472064049-get-started-as-an-on-call-responder) sent via email when invited to incident.io. - Confirming incident pages push through Do Not Disturb. +- Assuming the incident lead in incident.io. - Performing the [incident on-call responsibilities](https://fleetdm.com/handbook/company/product-groups#incident-on-call-responsibilities). @@ -199,7 +200,7 @@ Incident notifications are sent 24/7/365 via incident.io, triggered by creating Mitigating the outage may require writing and merging code. The current infrastructure on-call engineer is first line for all reviews and QA required to deploy a hot-fix. If additional code review or engineering support is needed, the responding engineer should escalate to their manager. -> If outside of business hours, the incident on-call engineer is responsible for stabilizing the issue well enough to pick it back up in the morning, and should file P1 issues for any immediate follow-up items. During business hours, the incident on-call engineer triages the incident and coordinates a response across engineering, QA, CS, and infrastructure until the incident has been resolved. +> If outside of business hours, the incident on-call engineer is responsible for stabilizing the issue well enough to pick it back up in the morning, and should file P1 issues for any immediate follow-up items. During business hours, the incident on-call engineer triages the incident and coordinates a response across engineering, QA, CS, and infrastructure until the incident has been resolved. See [incident on-call responsibilities](https://fleetdm.com/handbook/company/product-groups#incident-on-call-responsibilities) for details. ### Participate in QA Day