fleet/server/service/setup_experience.go
Scott Gress 61970118e9
Stop setup experience on software install failure (#34173)
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #33173
**Related issue:** Resolves #33111 

# Details

This is the remaining work to implement the "Stop the setup experience
when required software fails to install" feature. This didn't turn out
to be quite as straightforward as expected so I ended up doing a bit of
design-by-code and expect some feedback on the approach. I tried to make
it as low-touch as possible. The general design is:

1. In the `maybeUpdateSetupExperienceStatus` function which is called in
various places when a setup experience step is marked as completed, call
a new `maybeCancelPendingSetupExperienceSteps` function if the setup
step was marked as failed. Similarly call
`maybeCancelPendingSetupExperienceSteps` if a VPP app install fails to
enqueue.
2. In `maybeCancelPendingSetupExperienceSteps`, check whether the
specified host is MacOS and whether the "RequireAllSoftwareMacOS" flag
is set in the team (or global) config. If so, mark the remaining setup
experience items as canceled and cancel any upcoming activities related
to those steps.
3. On the front-end, if the `require_all_software_macos` is set and a
software step is marked as failed, show a new failure page indicating
that setup has failed and showing details of the failed software.
4. On the agent side, when checking setup experience status, send a
`reset_after_failure` flag _only the first time_. If this flag is set,
then the code in the `/orbit/setup_experience/status` handler will clear
and re-queue any failed setup experience steps (but leave successful
steps to avoid re-installing already-installed software). This
facilitates re-starting the setup experience when the host is rebooted.

I also updated the way that software (packages and VPP) is queued up for
the setup experience to be ordered alphabetically, to make it easier to
test _and_ because this is a desired outcome for a future story. Since
the order is not deterministic now, this update shouldn't cause any
problems (aside from a couple of test updates), but I'm ok taking it out
if desired.

# Checklist for submitter

If some of the following don't apply, delete the relevant line.

- [X] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
See [Changes
files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files)
for more information.

- [X] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)

## Testing

- [X] Added/updated automated tests
* Added a new integration test for software packages, testing that a
failed software package causes the rest of the setup experience to be
marked as failed when `require_all_software_macos` is set, and testing
that the "reset after failure" code works.
* Added a new integration test for VPP packages, testing that a failed
VPP enqueue causes the same halting of the setup experience.
I _don't_ have test for a failure _during_ a VPP install. It should go
through the same code path as the software package failure, so it's not
a huge gap.

- [ ] QA'd all new/changed functionality manually
Working on it 

## fleetd/orbit/Fleet Desktop

- [X] Verified compatibility with the latest released version of Fleet
(see [Must
rule](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/workflows/fleetd-development-and-release-strategy.md))
- [X] If the change applies to only one platform, confirmed that
`runtime.GOOS` is used as needed to isolate changes
- [X] Verified that fleetd runs on macOS, Linux and Windows


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- New Features
- Configurable option to halt macOS device setup if any software install
fails.
- Device setup page now shows a clear “Device setup failed” state with
expandable error details when all software is required on macOS.
- Improvements
- Setup status now includes per-step error messages for better
troubleshooting.
- Pending setup steps are automatically canceled after a failure when
applicable, with support to reset and retry the setup flow as
configured.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Ian Littman <iansltx@gmail.com>
2025-10-17 08:38:53 -05:00

388 lines
14 KiB
Go

package service
import (
"context"
"fmt"
"io"
"mime/multipart"
"net/http"
"path/filepath"
"strconv"
"time"
"github.com/docker/go-units"
"github.com/fleetdm/fleet/v4/server/contexts/ctxerr"
"github.com/fleetdm/fleet/v4/server/fleet"
"github.com/fleetdm/fleet/v4/server/ptr"
)
type putSetupExperienceSoftwareRequest struct {
Platform string `json:"platform"`
TeamID uint `json:"team_id"`
TitleIDs []uint `json:"software_title_ids"`
}
func (r *putSetupExperienceSoftwareRequest) ValidateRequest() error {
return validateSetupExperiencePlatform(r.Platform)
}
type putSetupExperienceSoftwareResponse struct {
Err error `json:"error,omitempty"`
}
func (r putSetupExperienceSoftwareResponse) Error() error { return r.Err }
func putSetupExperienceSoftware(ctx context.Context, request interface{}, svc fleet.Service) (fleet.Errorer, error) {
req := request.(*putSetupExperienceSoftwareRequest)
platform := transformPlatformForSetupExperience(req.Platform)
err := svc.SetSetupExperienceSoftware(ctx, platform, req.TeamID, req.TitleIDs)
if err != nil {
return &putSetupExperienceSoftwareResponse{Err: err}, nil
}
return &putSetupExperienceSoftwareResponse{}, nil
}
func (svc *Service) SetSetupExperienceSoftware(ctx context.Context, platform string, teamID uint, titleIDs []uint) error {
// skipauth: No authorization check needed due to implementation returning
// only license error.
svc.authz.SkipAuthorization(ctx)
return fleet.ErrMissingLicense
}
type getSetupExperienceSoftwareRequest struct {
Platform string `query:"platform,optional"`
fleet.ListOptions
TeamID uint `query:"team_id"`
}
func (r *getSetupExperienceSoftwareRequest) ValidateRequest() error {
return validateSetupExperiencePlatform(r.Platform)
}
type getSetupExperienceSoftwareResponse struct {
SoftwareTitles []fleet.SoftwareTitleListResult `json:"software_titles"`
Count int `json:"count"`
Meta *fleet.PaginationMetadata `json:"meta"`
Err error `json:"error,omitempty"`
}
func (r getSetupExperienceSoftwareResponse) Error() error { return r.Err }
func getSetupExperienceSoftware(ctx context.Context, request interface{}, svc fleet.Service) (fleet.Errorer, error) {
req := request.(*getSetupExperienceSoftwareRequest)
platform := transformPlatformForSetupExperience(req.Platform)
titles, count, meta, err := svc.ListSetupExperienceSoftware(ctx, platform, req.TeamID, req.ListOptions)
if err != nil {
return &getSetupExperienceSoftwareResponse{Err: err}, nil
}
return &getSetupExperienceSoftwareResponse{SoftwareTitles: titles, Count: count, Meta: meta}, nil
}
func (svc *Service) ListSetupExperienceSoftware(ctx context.Context, platform string, teamID uint, opts fleet.ListOptions) ([]fleet.SoftwareTitleListResult, int, *fleet.PaginationMetadata, error) {
// skipauth: No authorization check needed due to implementation returning
// only license error.
svc.authz.SkipAuthorization(ctx)
return nil, 0, nil, fleet.ErrMissingLicense
}
type getSetupExperienceScriptRequest struct {
TeamID *uint `query:"team_id,optional"`
Alt string `query:"alt,optional"`
}
type getSetupExperienceScriptResponse struct {
*fleet.Script
Err error `json:"error,omitempty"`
}
func (r getSetupExperienceScriptResponse) Error() error { return r.Err }
func getSetupExperienceScriptEndpoint(ctx context.Context, request interface{}, svc fleet.Service) (fleet.Errorer, error) {
req := request.(*getSetupExperienceScriptRequest)
downloadRequested := req.Alt == "media"
// // TODO: do we want to allow end users to specify team_id=0? if so, we'll need convert it to nil here so that we can
// // use it in the auth layer where team_id=0 is not allowed?
script, content, err := svc.GetSetupExperienceScript(ctx, req.TeamID, downloadRequested)
if err != nil {
return getSetupExperienceScriptResponse{Err: err}, nil
}
if downloadRequested {
return downloadFileResponse{
content: content,
filename: fmt.Sprintf("%s %s", time.Now().Format(time.DateOnly), script.Name),
}, nil
}
return getSetupExperienceScriptResponse{Script: script}, nil
}
func (svc *Service) GetSetupExperienceScript(ctx context.Context, teamID *uint, withContent bool) (*fleet.Script, []byte, error) {
// skipauth: No authorization check needed due to implementation returning
// only license error.
svc.authz.SkipAuthorization(ctx)
return nil, nil, fleet.ErrMissingLicense
}
type setSetupExperienceScriptRequest struct {
TeamID *uint
Script *multipart.FileHeader
}
func (setSetupExperienceScriptRequest) DecodeRequest(ctx context.Context, r *http.Request) (interface{}, error) {
var decoded setSetupExperienceScriptRequest
err := r.ParseMultipartForm(512 * units.MiB) // same in-memory size as for other multipart requests we have
if err != nil {
return nil, &fleet.BadRequestError{
Message: "failed to parse multipart form",
InternalErr: err,
}
}
val := r.MultipartForm.Value["team_id"]
if len(val) > 0 {
teamID, err := strconv.ParseUint(val[0], 10, 64)
if err != nil {
return nil, &fleet.BadRequestError{Message: fmt.Sprintf("failed to decode team_id in multipart form: %s", err.Error())}
}
// // TODO: do we want to allow end users to specify team_id=0? if so, we'll need to convert it to nil here so that we can
// // use it in the auth layer where team_id=0 is not allowed?
decoded.TeamID = ptr.Uint(uint(teamID))
}
fhs, ok := r.MultipartForm.File["script"]
if !ok || len(fhs) < 1 {
return nil, &fleet.BadRequestError{Message: "no file headers for script"}
}
decoded.Script = fhs[0]
return &decoded, nil
}
type setSetupExperienceScriptResponse struct {
Err error `json:"error,omitempty"`
}
func (r setSetupExperienceScriptResponse) Error() error { return r.Err }
func setSetupExperienceScriptEndpoint(ctx context.Context, request interface{}, svc fleet.Service) (fleet.Errorer, error) {
req := request.(*setSetupExperienceScriptRequest)
scriptFile, err := req.Script.Open()
if err != nil {
return setSetupExperienceScriptResponse{Err: err}, nil
}
defer scriptFile.Close()
if err := svc.SetSetupExperienceScript(ctx, req.TeamID, filepath.Base(req.Script.Filename), scriptFile); err != nil {
return setSetupExperienceScriptResponse{Err: err}, nil
}
return setSetupExperienceScriptResponse{}, nil
}
func (svc *Service) SetSetupExperienceScript(ctx context.Context, teamID *uint, name string, r io.Reader) error {
// skipauth: No authorization check needed due to implementation returning
// only license error.
svc.authz.SkipAuthorization(ctx)
return fleet.ErrMissingLicense
}
type deleteSetupExperienceScriptRequest struct {
TeamID *uint `query:"team_id,optional"`
}
type deleteSetupExperienceScriptResponse struct {
Err error `json:"error,omitempty"`
}
func (r deleteSetupExperienceScriptResponse) Error() error { return r.Err }
func deleteSetupExperienceScriptEndpoint(ctx context.Context, request interface{}, svc fleet.Service) (fleet.Errorer, error) {
req := request.(*deleteSetupExperienceScriptRequest)
// // TODO: do we want to allow end users to specify team_id=0? if so, we'll need convert it to nil here so that we can
// // use it in the auth layer where team_id=0 is not allowed?
if err := svc.DeleteSetupExperienceScript(ctx, req.TeamID); err != nil {
return deleteSetupExperienceScriptResponse{Err: err}, nil
}
return deleteSetupExperienceScriptResponse{}, nil
}
func (svc *Service) DeleteSetupExperienceScript(ctx context.Context, teamID *uint) error {
// skipauth: No authorization check needed due to implementation returning
// only license error.
svc.authz.SkipAuthorization(ctx)
return fleet.ErrMissingLicense
}
func (svc *Service) SetupExperienceNextStep(ctx context.Context, host *fleet.Host) (bool, error) {
// skipauth: No authorization check needed due to implementation returning
// only license error.
svc.authz.SkipAuthorization(ctx)
return false, fleet.ErrMissingLicense
}
func (svc *Service) IsAllSetupExperienceSoftwareRequired(ctx context.Context, host *fleet.Host) (bool, error) {
return isAllSetupExperienceSoftwareRequired(ctx, svc.ds, host)
}
func isAllSetupExperienceSoftwareRequired(ctx context.Context, ds fleet.Datastore, host *fleet.Host) (bool, error) {
teamID := host.TeamID
requireAllSoftware := false
if teamID == nil || *teamID == 0 {
ac, err := ds.AppConfig(ctx)
if err != nil {
return false, ctxerr.Wrap(ctx, err, "getting app config")
}
requireAllSoftware = ac.MDM.MacOSSetup.RequireAllSoftware
} else {
team, err := ds.Team(ctx, *teamID)
if err != nil {
return false, ctxerr.Wrap(ctx, err, "load team")
}
requireAllSoftware = team.Config.MDM.MacOSSetup.RequireAllSoftware
}
return requireAllSoftware, nil
}
func (svc *Service) MaybeCancelPendingSetupExperienceSteps(ctx context.Context, host *fleet.Host) error {
return maybeCancelPendingSetupExperienceSteps(ctx, svc.ds, host)
}
func maybeCancelPendingSetupExperienceSteps(ctx context.Context, ds fleet.Datastore, host *fleet.Host) error {
// If the host is not MacOS, we do nothing.
if host.Platform != "darwin" {
return nil
}
requireAllSoftware, err := isAllSetupExperienceSoftwareRequired(ctx, ds, host)
if err != nil {
return ctxerr.Wrap(ctx, err, "checking if all software is required")
}
if !requireAllSoftware {
return nil
}
hostUUID, err := fleet.HostUUIDForSetupExperience(host)
if err != nil {
return ctxerr.Wrap(ctx, err, "failed to get host's UUID for the setup experience")
}
statuses, err := ds.ListSetupExperienceResultsByHostUUID(ctx, hostUUID)
if err != nil {
return ctxerr.Wrap(ctx, err, "retrieving setup experience status results for next step")
}
for _, status := range statuses {
if err := status.IsValid(); err != nil {
return ctxerr.Wrap(ctx, err, "invalid row")
}
if status.Status != fleet.SetupExperienceStatusPending && status.Status != fleet.SetupExperienceStatusRunning {
continue
}
// Cancel any upcoming software installs, vpp installs or script runs.
var executionID string
switch {
case status.HostSoftwareInstallsExecutionID != nil:
executionID = *status.HostSoftwareInstallsExecutionID
case status.NanoCommandUUID != nil:
executionID = *status.NanoCommandUUID
case status.ScriptExecutionID != nil:
executionID = *status.ScriptExecutionID
default:
continue
}
if _, err := ds.CancelHostUpcomingActivity(ctx, host.ID, executionID); err != nil {
return ctxerr.Wrap(ctx, err, "cancelling upcoming setup experience activity")
}
}
// Cancel any pending setup experience steps for the host in the database.
if err := ds.CancelPendingSetupExperienceSteps(ctx, hostUUID); err != nil {
return ctxerr.Wrap(ctx, err, "cancelling pending setup experience steps")
}
return nil
}
// maybeUpdateSetupExperienceStatus attempts to update the status of a setup experience result in
// the database. If the given result is of a supported type (namely SetupExperienceScriptResult,
// SetupExperienceSoftwareInstallResult, and SetupExperienceVPPInstallResult), it returns a boolean
// indicating whether the datastore was updated and an error if one occurred. If the result is not of a
// supported type, it returns false and an error indicated that the type is not supported.
// If the skipPending parameter is true, the datastore will only be updated if the given result
// status is not pending.
func maybeUpdateSetupExperienceStatus(ctx context.Context, ds fleet.Datastore, result interface{}, requireTerminalStatus bool) (bool, error) {
var updated bool
var err error
var status fleet.SetupExperienceStatusResultStatus
var hostUUID string
switch v := result.(type) {
case fleet.SetupExperienceScriptResult:
status = v.SetupExperienceStatus()
if !status.IsValid() {
return false, fmt.Errorf("invalid status: %s", status)
} else if requireTerminalStatus && !status.IsTerminalStatus() {
return false, nil
}
return ds.MaybeUpdateSetupExperienceScriptStatus(ctx, v.HostUUID, v.ExecutionID, status)
case fleet.SetupExperienceSoftwareInstallResult:
status = v.SetupExperienceStatus()
hostUUID = v.HostUUID
if !status.IsValid() {
return false, fmt.Errorf("invalid status: %s", status)
} else if requireTerminalStatus && !status.IsTerminalStatus() {
return false, nil
}
updated, err = ds.MaybeUpdateSetupExperienceSoftwareInstallStatus(ctx, v.HostUUID, v.ExecutionID, status)
case fleet.SetupExperienceVPPInstallResult:
// NOTE: this case is also implemented in the CommandAndReportResults method of
// MDMAppleCheckinAndCommandService
status = v.SetupExperienceStatus()
hostUUID = v.HostUUID
if !status.IsValid() {
return false, fmt.Errorf("invalid status: %s", status)
} else if requireTerminalStatus && !status.IsTerminalStatus() {
return false, nil
}
updated, err = ds.MaybeUpdateSetupExperienceVPPStatus(ctx, v.HostUUID, v.CommandUUID, status)
default:
return false, fmt.Errorf("unsupported result type: %T", result)
}
// For software / vpp installs, if we updated the status to failure and the host is macOS,
// we may need to cancel the rest of the setup experience.
if updated && err == nil && status == fleet.SetupExperienceStatusFailure {
// Look up the host by UUID to get its platform and team.
host, getHostUUIDErr := ds.HostByIdentifier(ctx, hostUUID)
if getHostUUIDErr != nil {
return updated, fmt.Errorf("getting host by UUID: %w", getHostUUIDErr)
}
cancelErr := maybeCancelPendingSetupExperienceSteps(ctx, ds, host)
if cancelErr != nil {
return updated, fmt.Errorf("cancel setup experience after macos software install failure: %w", cancelErr)
}
}
return updated, err
}
func validateSetupExperiencePlatform(platform string) error {
if platform != "" && platform != "macos" && platform != "ios" && platform != "ipados" && platform != "windows" && platform != "linux" {
return badRequestf("platform %q unsupported, platform must be \"macos\", \"ios\", \"ipados\", \"windows\", or \"linux\"", platform)
}
return nil
}
func transformPlatformForSetupExperience(platform string) string {
if platform == "" || platform == "macos" {
return "darwin"
}
return platform
}