mirror of https://github.com/fleetdm/fleet synced 2026-05-24 09:28:54 +00:00

Improvements for Redis stress testing tool (#44411 )

Refresh `tools/redis-stress` with cluster-aware modes.

2026-04-29 15:52:59 -04:00

5.2 KiB

Raw Blame History

redis-stress

Cluster-aware Redis stress tool with two modes.

Both modes use Fleet's own redis.NewPool (server/datastore/redis), so cluster topology, redirection handling, and connection routing match what the real Fleet server does in production.

Modes

`write` — steady SET-only load

Fill a Redis instance (standalone or cluster) at a configurable rate. Each worker writes keys on its own ticker; useful for "occupy the cluster with ongoing writes while I observe something else" or seeding a dataset.

This mode is the cluster-aware successor to the original tools/redis-stress tool. The old subcommand-less invocation (redis-stress -addr=X -wait=10m) still works — when the first arg starts with -, the dispatcher routes to write. The legacy flags -wait, -debug, and -index-start are kept for backward compatibility.

go run ./tools/redis-stress write \
  -addr 127.0.0.1:7001 \
  -workers 5 \
  -rate 100 \
  -duration 1m

Flag	Default	Purpose
`-addr`	`127.0.0.1:6379`	Redis address (cluster startup node OK; cluster auto-detected)
`-workers`	`1`	Concurrent SET workers
`-rate`	`1`	SETs per worker per second (fractional OK)
`-duration`	`10m`	Total run time
`-key-prefix`	`stress_write_`	Key prefix
`-key-ttl`	`10m`	Per-key expiration
`-index-start`	`0`	Starting value of each worker's per-key counter (legacy)
`-debug`	`false`	Log every SET (legacy)
`-wait`	—	Alias for `-duration` (legacy)
`-cluster-follow-redirects`	`true`	`ClusterFollowRedirections` (cluster only)
`-cluster-read-from-replica`	`false`	`ClusterReadFromReplica` (cluster only)

`race` — SET-then-GET race detection

Each worker repeatedly does, on fresh pool connections:

conn1 := pool.Get(); conn1.Do("SET", k, v, "PX", ttl); conn1.Close()
conn2 := pool.Get(); conn2.Do("GET", k);                conn2.Close()

and counts any GET that returns nil immediately after a successful SET on the same key. This mirrors how Fleet's server/service/redis_key_value.RedisKeyValue does its Set / Get — fresh connection per call.

# default — cluster mode, reads through ConfigureDoer (i.e., not explicitly
# routed to replicas; same path Fleet uses today)
go run ./tools/redis-stress race -addr 127.0.0.1:7001 -workers 50 -iterations 2000

# explicit replica reads — useful for testing what a deployment with an
# external read-from-replica router (proxy, ElastiCache reader endpoint,
# Redis-Enterprise R/W split) would expose
go run ./tools/redis-stress race -addr 127.0.0.1:7001 \
  -cluster-read-from-replica \
  -explicit-readonly

Flag	Default	Purpose
`-addr`	`127.0.0.1:7001`	Redis cluster startup node
`-workers`	`50`	Concurrent SET-then-GET workers
`-iterations`	`1000`	Iterations per worker
`-ttl`	`4m`	PX expiration on SET
`-key-prefix`	`stress_race_`	Key prefix
`-explicit-readonly`	`false`	Wrap the GET conn with `redis.ReadOnlyConn` so it's routed to a replica when the pool has `ClusterReadFromReplica=true`. Without this flag, both SET and GET go to primary.
`-cluster-follow-redirects`	`true`	`ClusterFollowRedirections`
`-cluster-read-from-replica`	`true`	`ClusterReadFromReplica`

Output

================ summary ================
elapsed:           11.2s
sets:              100000 (errors 0)
gets:              100000 (errors 0)
nil-after-set:     0  ← the bug
stale-after-set:   0
ops/sec:           17858.2

nil-after-set is the metric to watch. Any non-zero value means the cluster served a GET for a key the same code had just SET and gotten an OK acknowledgement for. The tool exits with status 1 when this happens.

stale-after-set should be zero in normal operation (the key namespace is worker-and-iteration-specific) and is included as a defense against unexpected key collisions.

Bringing up a local Redis cluster

The repo includes a 6-node Redis Cluster compose file. From the repo root:

docker compose -f docker-compose.yml -f docker-compose-redis-cluster.yml up -d \
  redis-cluster-1 redis-cluster-2 redis-cluster-3 \
  redis-cluster-4 redis-cluster-5 redis-cluster-6 \
  redis-cluster-setup

Verify the cluster came up:

docker exec fleet-redis-cluster-1-1 redis-cli -p 7001 cluster info | grep cluster_state
# cluster_state:ok

On macOS, host-to-cluster networking requires docker-mac-net-connect:

brew install chipmk/tap/docker-mac-net-connect
sudo brew services start chipmk/tap/docker-mac-net-connect

Without it, only port-forwarded nodes are reachable and any cluster redirect times out.

Forcing a replica to lag

To verify the race detector works mechanically (or to simulate a customer deployment where reads can fall behind writes), pause one of the replica nodes mid-test:

docker pause fleet-redis-cluster-4-1
go run ./tools/redis-stress race \
  -addr 127.0.0.1:7001 -explicit-readonly -workers 20 -iterations 500
docker unpause fleet-redis-cluster-4-1

With -explicit-readonly and a paused replica, you should see non-zero nil-after-set events for keys whose slot's replica was the paused one.

5.2 KiB Raw Blame History