#22363
The TestRenewEnrollmentProfilePrevented test timed out once in recent
history. I did not find any obvious issues with this test, but I added
logging which may help if it times out again in the future.
Also, I sped the test up.
New interface for adding periodic jobs that rely on notifications/config
changes in Orbit.
Previously if we wanted to have recurring checks in Orbit, we would add
them into a chain of `GetConfig` calls. This call chain would be run
periodically by one of the runners registered with the cli application
framework.
The new method to register `OrbitConfigReceivers` with the
`OrbitClient`, and then register the orbit client itself with the
application framework.
Instead of having giving each fetcher an internal reference to the
previous fetcher that it must call, the receiver is registered with the
client and the new config is passed to the receiver.
This is the old `GetConfig()` interface:
```go
type OrbitConfigFetcher interface {
GetConfig() (*fleet.OrbitConfig, error)
}
```
This is the new `OrbitConfigReceiver` interface:
```go
type OrbitConfigReceiver interface {
Run(*OrbitConfig) error
}
```
To register a new receiver, you call the `RegisterConfigReceiver` method
on the client.
```go
orbitClient.RegisterConfigReceiver(extRunner)
```
Downsides of the old method:
- Spaghetti call chain setup
- Cascading failure, of one fails, all after it fail
- Run in series, one long function call holds up the rest
- Anything that wants to restart orbit is added as a Runner to the
application, meaning there could be several timers calling `GetConfig`
and running the chain
Benefits of the new method:
- Clean `RegisterConfigReceiver` api, no call chaining required
- Config receivers can be added at runtime
- Isolated receivers, one failing call don't effect others
- All calls are run in parallel in goroutines, no calls can hold up the
rest
- No more need for multiple runners, using a context cancel, any
receiver can queue a call to restart orbit
- Single point to handle errors and logging for all receivers
- Panic recovery to stop orbit from crashing
- Easier to test, configs are passed in and do not require a call chain
This branch contains a little bit of code from the installer method I
was working on because I branched it off of that. (oops)
Not all code comments surrounding old `GetConfig()` methods have been
fully updated yet
Possible changes:
- Update the interface to take a context, so we can let receivers know
to exit early. I can imagine two cases for this:
- The application is about to restart
- We can set a timeout for how long receivers are allowed to take
Closes#12662
---------
Co-authored-by: Martin Angers <martin.n.angers@gmail.com>
Co-authored-by: Roberto Dip <dip.jesusr@gmail.com>
The scheduled test run
https://github.com/fleetdm/fleet/actions/runs/7764392848 failed with a
panic because `TestWindowsMDMEnrollmentPrevented` timed out:
```
2024-02-03T05:05:26.3041218Z === RUN TestWindowsMDMEnrollmentPrevented
2024-02-03T05:05:26.3044251Z === RUN TestWindowsMDMEnrollmentPrevented/{RenewEnrollmentProfile:false_RotateDiskEncryptionKey:false_NeedsMDMMigration:false_NeedsProgrammaticWindowsMDMEnrollment:true_WindowsMDMDiscoveryEndpoint:http://example.com/_NeedsProgrammaticWindowsMDMUnenrollment:false_PendingScriptExecutionIDs:[]_EnforceBitLockerEncryption:false}
2024-02-03T05:05:26.3047208Z coverage: 2.5% of statements in github.com/fleetdm/fleet/v4/...
2024-02-03T05:05:26.3047963Z panic: test timed out after 1h0m0s
2024-02-03T05:05:26.3048482Z running tests:
2024-02-03T05:05:26.3049005Z TestWindowsMDMEnrollmentPrevented (59m52s)
2024-02-03T05:05:26.3052172Z TestWindowsMDMEnrollmentPrevented/{RenewEnrollmentProfile:false_RotateDiskEncryptionKey:false_NeedsMDMMigration:false_NeedsProgrammaticWindowsMDMEnrollment:true_WindowsMDMDiscoveryEndpoint:http://example.com/_NeedsProgrammaticWindowsMDMUnenrollment:false_PendingScriptExecutionIDs:[]_EnforceBitLockerEncryption:false} (59m52s)
[...]
2024-02-03T05:05:26.3068624Z goroutine 69 [chan receive]:
2024-02-03T05:05:26.3069997Z github.com/fleetdm/fleet/v4/orbit/pkg/update.TestWindowsMDMEnrollmentPrevented.func2.1({{0xe3ada3, 0x12}, {0x0, 0x0}, {0xe37311, 0xc}})
2024-02-03T05:05:26.3072376Z /home/runner/work/fleet/fleet/orbit/pkg/update/notifications_test.go:295 +0x65
2024-02-03T05:05:26.3074514Z github.com/fleetdm/fleet/v4/orbit/pkg/update.(*windowsMDMEnrollmentConfigFetcher).attemptEnrollment(0xc0000f8cf0, {0x0, 0x0, 0x0, 0x1, {0xe3ada3, 0x12}, 0x0, {0x0, 0x0, ...}, ...})
```
I was able to reproduce locally 1/4th of the times, after putting the
following print statements:
```diff
if cfg.NeedsProgrammaticWindowsMDMEnrollment {
fetcher.execEnrollFn = func(args WindowsMDMEnrollmentArgs) error {
- <-chProceed // will be unblocked only when allowed
+ fmt.Println("fetcher.execEnrollFn A: ", apiCallCount)
+ <-chProceed // will be unblocked only when allowed
+ fmt.Println("fetcher.execEnrollFn B: ", apiCallCount)
apiCallCount++ // no need for sync, single-threaded call of this func is guaranteed by the fetcher's mutex
return apiErr
}
@@ -301,7 +303,9 @@ func TestWindowsMDMEnrollmentPrevented(t *testing.T) {
}
} else {
fetcher.execUnenrollFn = func(args WindowsMDMEnrollmentArgs) error {
- <-chProceed // will be unblocked only when allowed
+ fmt.Println("fetcher.execUnenrollFn A: ", apiCallCount)
+ <-chProceed // will be unblocked only when allowed
+ fmt.Println("fetcher.execUnenrollFn B: ", apiCallCount)
apiCallCount++ // no need for sync, single-threaded call of this func is guaranteed by the fetcher's mutex
return apiErr
}
@@ -317,23 +321,33 @@ func TestWindowsMDMEnrollmentPrevented(t *testing.T) {
started := make(chan struct{})
go func() {
+ fmt.Println("before close started")
close(started)
+ fmt.Println("aftre close started")
// the first call will block in enroll/unenroll func
+ fmt.Println("before inner fetchergetconfig")
cfg, err := fetcher.GetConfig()
+ fmt.Println("after inner fetchergetconfig")
assertResult(cfg, err)
}()
+ fmt.Println("before started")
<-started
+ fmt.Println("after started")
// this call will happen while the first call is blocked in
// enroll/unenrollfn, so it won't call the API (won't be able to lock the
// mutex). However it will still complete successfully without being
// blocked by the other call in progress.
+ fmt.Println("before first fetchergetconfig")
cfg, err := fetcher.GetConfig()
+ fmt.Println("before first fetchergetconfig")
assertResult(cfg, err)
// unblock the first call and wait for it to complete
+ fmt.Println("before close chProceed 1")
close(chProceed)
+ fmt.Println("after close chProceed 2")
time.Sleep(100 * time.Millisecond)
```
This is the output I've got every time the test hung:
```
before started
before close started
aftre close started
after started
before first fetchergetconfig
before inner fetchergetconfig
after inner fetchergetconfig
fetcher.execEnrollFn A: 0
```
And this is the output when the tests passed
```
before started
before close started
aftre close started
before inner fetchergetconfig
fetcher.execUnenrollFn A: 0
after started
before first fetchergetconfig
before first fetchergetconfig
before close chProceed 1
after close chProceed 2
fetcher.execUnenrollFn B: 0
after inner fetchergetconfig
fetcher.execUnenrollFn A: 1
fetcher.execUnenrollFn B: 1
```
Note how the deadlock occurs when `GetConfig` is called first outside of
the goroutine. I added some logic to prevent this, but I'm confident
there must be a better way to accomplish the same. cc: @mna you're the
king of concurrency, do you have any ideas?
for #15916, explanation of the rationale in the description of
`isMisreportedDecryptionError` and in the issue comments.
I refactored the code a little bit, trying to make it easier to follow
even with the added complexity.
This also paves the road for #15711
related to #12847
This changes the authentication method for windows mdm enrollment. We
were using `HostByIndentifier ` method but have changed to
`LoadHostByOrbitNodeKey`.
- [x] Changes file added for user-visible changes in `changes/` or
`orbit/changes/`.
See [Changes
files](https://fleetdm.com/docs/contributing/committing-changes#changes-files)
for more information.
- [x] Added/updated tests
- [x] Manual QA for all new/changed functionality
- For Orbit and Fleet Desktop changes:
- [ ] Manual QA must be performed in the three main OSs, macOS, Windows
and Linux.
- [ ] Auto-update manual QA, from released version of component to new
version (see [tools/tuf/test](../tools/tuf/test/README.md)).