The scheduled test run
https://github.com/fleetdm/fleet/actions/runs/7764392848 failed with a
panic because `TestWindowsMDMEnrollmentPrevented` timed out:
```
2024-02-03T05:05:26.3041218Z === RUN TestWindowsMDMEnrollmentPrevented
2024-02-03T05:05:26.3044251Z === RUN TestWindowsMDMEnrollmentPrevented/{RenewEnrollmentProfile:false_RotateDiskEncryptionKey:false_NeedsMDMMigration:false_NeedsProgrammaticWindowsMDMEnrollment:true_WindowsMDMDiscoveryEndpoint:http://example.com/_NeedsProgrammaticWindowsMDMUnenrollment:false_PendingScriptExecutionIDs:[]_EnforceBitLockerEncryption:false}
2024-02-03T05:05:26.3047208Z coverage: 2.5% of statements in github.com/fleetdm/fleet/v4/...
2024-02-03T05:05:26.3047963Z panic: test timed out after 1h0m0s
2024-02-03T05:05:26.3048482Z running tests:
2024-02-03T05:05:26.3049005Z TestWindowsMDMEnrollmentPrevented (59m52s)
2024-02-03T05:05:26.3052172Z TestWindowsMDMEnrollmentPrevented/{RenewEnrollmentProfile:false_RotateDiskEncryptionKey:false_NeedsMDMMigration:false_NeedsProgrammaticWindowsMDMEnrollment:true_WindowsMDMDiscoveryEndpoint:http://example.com/_NeedsProgrammaticWindowsMDMUnenrollment:false_PendingScriptExecutionIDs:[]_EnforceBitLockerEncryption:false} (59m52s)
[...]
2024-02-03T05:05:26.3068624Z goroutine 69 [chan receive]:
2024-02-03T05:05:26.3069997Z github.com/fleetdm/fleet/v4/orbit/pkg/update.TestWindowsMDMEnrollmentPrevented.func2.1({{0xe3ada3, 0x12}, {0x0, 0x0}, {0xe37311, 0xc}})
2024-02-03T05:05:26.3072376Z /home/runner/work/fleet/fleet/orbit/pkg/update/notifications_test.go:295 +0x65
2024-02-03T05:05:26.3074514Z github.com/fleetdm/fleet/v4/orbit/pkg/update.(*windowsMDMEnrollmentConfigFetcher).attemptEnrollment(0xc0000f8cf0, {0x0, 0x0, 0x0, 0x1, {0xe3ada3, 0x12}, 0x0, {0x0, 0x0, ...}, ...})
```
I was able to reproduce locally 1/4th of the times, after putting the
following print statements:
```diff
if cfg.NeedsProgrammaticWindowsMDMEnrollment {
fetcher.execEnrollFn = func(args WindowsMDMEnrollmentArgs) error {
- <-chProceed // will be unblocked only when allowed
+ fmt.Println("fetcher.execEnrollFn A: ", apiCallCount)
+ <-chProceed // will be unblocked only when allowed
+ fmt.Println("fetcher.execEnrollFn B: ", apiCallCount)
apiCallCount++ // no need for sync, single-threaded call of this func is guaranteed by the fetcher's mutex
return apiErr
}
@@ -301,7 +303,9 @@ func TestWindowsMDMEnrollmentPrevented(t *testing.T) {
}
} else {
fetcher.execUnenrollFn = func(args WindowsMDMEnrollmentArgs) error {
- <-chProceed // will be unblocked only when allowed
+ fmt.Println("fetcher.execUnenrollFn A: ", apiCallCount)
+ <-chProceed // will be unblocked only when allowed
+ fmt.Println("fetcher.execUnenrollFn B: ", apiCallCount)
apiCallCount++ // no need for sync, single-threaded call of this func is guaranteed by the fetcher's mutex
return apiErr
}
@@ -317,23 +321,33 @@ func TestWindowsMDMEnrollmentPrevented(t *testing.T) {
started := make(chan struct{})
go func() {
+ fmt.Println("before close started")
close(started)
+ fmt.Println("aftre close started")
// the first call will block in enroll/unenroll func
+ fmt.Println("before inner fetchergetconfig")
cfg, err := fetcher.GetConfig()
+ fmt.Println("after inner fetchergetconfig")
assertResult(cfg, err)
}()
+ fmt.Println("before started")
<-started
+ fmt.Println("after started")
// this call will happen while the first call is blocked in
// enroll/unenrollfn, so it won't call the API (won't be able to lock the
// mutex). However it will still complete successfully without being
// blocked by the other call in progress.
+ fmt.Println("before first fetchergetconfig")
cfg, err := fetcher.GetConfig()
+ fmt.Println("before first fetchergetconfig")
assertResult(cfg, err)
// unblock the first call and wait for it to complete
+ fmt.Println("before close chProceed 1")
close(chProceed)
+ fmt.Println("after close chProceed 2")
time.Sleep(100 * time.Millisecond)
```
This is the output I've got every time the test hung:
```
before started
before close started
aftre close started
after started
before first fetchergetconfig
before inner fetchergetconfig
after inner fetchergetconfig
fetcher.execEnrollFn A: 0
```
And this is the output when the tests passed
```
before started
before close started
aftre close started
before inner fetchergetconfig
fetcher.execUnenrollFn A: 0
after started
before first fetchergetconfig
before first fetchergetconfig
before close chProceed 1
after close chProceed 2
fetcher.execUnenrollFn B: 0
after inner fetchergetconfig
fetcher.execUnenrollFn A: 1
fetcher.execUnenrollFn B: 1
```
Note how the deadlock occurs when `GetConfig` is called first outside of
the goroutine. I added some logic to prevent this, but I'm confident
there must be a better way to accomplish the same. cc: @mna you're the
king of concurrency, do you have any ideas?
for #15916, explanation of the rationale in the description of
`isMisreportedDecryptionError` and in the issue comments.
I refactored the code a little bit, trying to make it easier to follow
even with the added complexity.
This also paves the road for #15711
related to #12847
This changes the authentication method for windows mdm enrollment. We
were using `HostByIndentifier ` method but have changed to
`LoadHostByOrbitNodeKey`.
- [x] Changes file added for user-visible changes in `changes/` or
`orbit/changes/`.
See [Changes
files](https://fleetdm.com/docs/contributing/committing-changes#changes-files)
for more information.
- [x] Added/updated tests
- [x] Manual QA for all new/changed functionality
- For Orbit and Fleet Desktop changes:
- [ ] Manual QA must be performed in the three main OSs, macOS, Windows
and Linux.
- [ ] Auto-update manual QA, from released version of component to new
version (see [tools/tuf/test](../tools/tuf/test/README.md)).