mirror of
https://github.com/fleetdm/fleet
synced 2026-05-23 08:58:41 +00:00
Updates on how to request infrastructure oncall assistance (#36205)
Updates on how to request infrastructure oncall assistance --------- Co-authored-by: Sam Pfluger <108141731+Sampfluger88@users.noreply.github.com>
This commit is contained in:
parent
5850840d04
commit
261724f6a5
1 changed files with 14 additions and 5 deletions
|
|
@ -207,6 +207,7 @@ During the window of time available to investigate an issue, use the resources a
|
|||
- Escalate to other CSE's or CSA's.
|
||||
- Contact the developer on-call.
|
||||
|
||||
Note: For non-CSA engaged customer requests, CSE's are responsible for escalations to a CSA as needed.
|
||||
|
||||
### Contact the developer on-call
|
||||
|
||||
|
|
@ -216,6 +217,14 @@ The acting developer on-call rotation is reflected in the [📈KPIs spreadsheet
|
|||
|
||||
- An automated weekly [on-call handoff](https://fleetdm.com/handbook/engineering#handoff) Slack thread in #g-engineering provides the opportunity to discuss highlights, improvements, and hand off ongoing issues.
|
||||
|
||||
### Contact the infrastructure engineer on-call
|
||||
|
||||
The acting infrastructure engineer on-call rotation is reflected in the [📈KPIs spreadsheet (confidential Google sheet)](https://docs.google.com/spreadsheets/d/1Hso0LxqwrRVINCyW_n436bNHmoqhoLhC8bcbvLPOs9A/edit#gid=0&range=F2 ). The individual on-call is responsible for responding to infrastructure-related Slack comments, Slack threads, and GitHub issues raised by customers and the community that the CSE team cannot address. These may be related to self-hosted or Fleet Managed cloud bugs or performance issues, which are suspected to be infrastructure-related.
|
||||
- To reach the infrastructure engineer on-call for assistance, a CSE or developer should mention them in Slack using `@infrastructure-oncall` in the [#help-infrastructure](https://fleetdm.slack.com/archives/C051QJU3D0V) channel or in the customer channel where the original request lives.
|
||||
- Support issues must be handled in the relevant customer or internal Slack channel rather than Direct Messages (DMs). This will ensure that questions and solutions can be easily referenced in the future and help the infrastructure engineering team focused on their planned work.
|
||||
- A CSE or CSA must always triage and process suspected infrastructure issues before tagging in the infrastructure engineer on-call.
|
||||
- If your request for infrastructure is not urgent and/or not related to a suspected bug or performance issue impacting a customer, please create an issue on the [#help-customers kanban board](https://github.com/orgs/fleetdm/projects/79/views/1?filterQuery=) and @ mention the SVP of Customer Success to request prioritization.
|
||||
|
||||
|
||||
### Onboard a customer success team member
|
||||
|
||||
|
|
@ -250,20 +259,20 @@ The acting developer on-call rotation is reflected in the [📈KPIs spreadsheet
|
|||
|
||||
### Respond to messages and alerts
|
||||
|
||||
Customer Support and 24/7 on-call Engineers are responsible for the first response to Slack messages in the [#fleet channel](https://osquery.slack.com/archives/C01DXJL16D8) of osquery Slack, and other public Slacks.
|
||||
- The 24/7 on-call is responsible for alarms related to fleetdm.com and Fleet Managed Cloud, as well as delivering 24/7 support for Fleet Premium customers. Use [on-call runbooks](https://github.com/fleetdm/confidential/tree/main/infrastructure/runbooks#readme) to guide your response. Runbooks provided detailed, step-by-step instructions to quickly and effectively respond to and resolve most 24/7 on-call alerts.
|
||||
- We respond within 1-hour during business hours and 4 hours outside business hours. Note that we do not need to have answers within 1 hour -- we need to at least acknowledge and collect any additional necessary information while researching/escalating to find answers internally.
|
||||
Customer Support Engineers (CSEs) are responsible for the first response to Slack messages in the [#fleet channel](https://osquery.slack.com/archives/C01DXJL16D8) of osquery Slack, MacAdmins Slack and dedicated customer Slack channels.
|
||||
- The 24/7 infrastructure on-call engineer is responsible for alarms related to fleetdm.com and Fleet Managed Cloud, as well as delivering 24/7 support for Fleet Premium customers when tagged in for assistance. Use [on-call runbooks](https://github.com/fleetdm/confidential/tree/main/infrastructure/runbooks#readme) to guide your response. Runbooks provide detailed, step-by-step instructions to quickly and effectively respond to and resolve most 24/7 on-call alerts.
|
||||
- We respond within 1-hour or less during business hours and 4 hours outside business hours. Note that we do not need to have answers within 1 hour -- we need to at least acknowledge and collect any additional necessary information while researching/escalating to find answers internally.
|
||||
|
||||
|
||||
### Maintain first responder SLA
|
||||
|
||||
The first responder on-call for Managed Cloud will take ownership of the @infrastructure-oncall alias in Slack first thing Monday morning. The previous week's on-call will provide a summary in the #help-customers Slack channel with an update on alarms that came up the week before, open issues with or without direct end-user impact, and other issues to keep an eye out for.
|
||||
- **First responders:** Robert Fairburn, Kathy Satterlee
|
||||
- **First responders:** Robert Fairburn, Jorge Falcon
|
||||
|
||||
Escalation of alarms will be done manually by the first responder according to the escalation contacts mentioned above. A [suspected outage issue](https://github.com/fleetdm/confidential/issues/new?assignees=&labels=%23outage%2C%23g-cx%2C%3Arelease&projects=&template=outage.md&title=Suspected+outage%3A+YYYY-MM-DD) should be created to track the escalation and determine root cause.
|
||||
- **Escalations (in order):** » Eric Shaw (fleetdm.com) » Zay Hanlon » Luke Heath » Mike McNeil
|
||||
|
||||
All infrastructure alarms (fleetdm.com and Managed Cloud) will go to #help-p1. When the current 24/7 on-call engineer is unable to meet the response time SLAs, it is their responsibility to arrange and designate a replacement who will assume the @oncall-infrastructure Slack alias.
|
||||
All infrastructure alarms (fleetdm.com and Managed Cloud) will go to #help-p1. When the current 24/7 on-call engineer is unable to meet the response time SLAs, it is their responsibility to arrange and designate a replacement who will assume the @infrastructure-oncall Slack alias.
|
||||
|
||||
|
||||
### Communicate feedback on prioritized customer requests
|
||||
|
|
|
|||
Loading…
Reference in a new issue