## Summary
- Split CI workflow to stop re-running validation when PRs are merged to
main
- Create dedicated `release.yml` workflow for Docker Hub releases on
main branch
- Keep full CI validation for PRs and `perm-*` branches
## Motivation
Since the repository is configured to:
1. Require PRs to be up-to-date with main before merging
2. Require all CI checks to pass
Re-running the full CI suite (~12 jobs) on main after merge is redundant
and wastes CI runner time that could be used for other tasks.
## Changes
| Workflow | Before | After |
|----------|--------|-------|
| `CI.yml` | Triggers on push to `main`, `perm-*`, and PRs to `main` |
Triggers on push to `perm-*` and PRs to `main` only |
| `release.yml` | N/A (new) | Triggers on push to `main`, runs only
`docker-build-release` |
## Impact
| Event | Before | After | Savings |
|-------|--------|-------|---------|
| PR to main | 13 jobs | 12 jobs | 1 job |
| Merge to main | 13 jobs | 1 job | 12 jobs |
| Push to perm-* | 13 jobs | 12 jobs | 1 job |
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Summary
Improve CI performance through better caching and simplified workflows:
- Add a new `warm-sccache` job that runs before all Rust CI jobs to
pre-populate the sccache cache
- Cache locally installed tools (`~/.local/bin`, `~/.local/lib`,
`~/.local/include`) with a version-based hash key
- Simplify Rust tests by removing matrix partitioning and switching from
cargo-nextest to cargo test
## Problem
**Poor sccache hit rates:**
- `rust-lint`, `unit-tests`, and `build-operator` ran in parallel with
cold caches
- Each job compiled dependencies independently
- Cache was only saved at job completion (too late for parallel jobs to
benefit)
**Redundant tool downloads:**
- Mold, LLVM/Clang, protoc, and libpq (~500MB+) were downloaded fresh on
each job
- No caching of locally installed tools
**Overcomplicated test setup:**
- 2-partition matrix for tests added complexity without significant
benefit
- cargo-nextest required installation step (~30s overhead)
- Separate result-checker job wasn't necessary
## Solution
### 1. sccache warm-up job (`task-warm-sccache.yml`)
- Runs first (Tier 0) before all Rust jobs
- Compiles with release mode + all features (`fast-runtime`,
`try-runtime`, `runtime-benchmarks`)
- Compiles with debug mode to cover test builds
- Uses `SKIP_WASM_BUILD=1` to minimize warm-up time
### 2. Local tools caching (`setup-env/action.yml`)
- Define tool versions as env vars (`MOLD_VERSION`, `LLVM_VERSION`,
`PROTOC_VERSION`, `LIBPQ_VERSION`)
- Generate SHA256 hash from versions for cache key
- Cache `~/.local/bin`, `~/.local/lib`, `~/.local/include` (not all of
`~/.local` to avoid container storage)
- Set up PATH and env vars immediately after cache restore
### 3. Simplified Rust tests (`task-rust-tests.yml`)
- Remove 2-partition matrix strategy
- Replace cargo-nextest with `cargo test --locked`
- Remove separate tests-result-checker job
## CI Flow
```
┌─ build-operator (warm sccache + cached tools)
│
CI Start → warm-sccache ─┼─ rust-lint (warm sccache + cached tools)
│
└─ unit-tests (warm sccache + cached tools)
```
## Test plan
- [x] CI workflow runs successfully
- [x] Warm-sccache job completes and shows cache stats
- [x] Local tools cache restores correctly (no permission errors)
- [x] Downstream Rust jobs show improved cache hit rates
- [x] Rust tests pass with simplified single-job setup
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Ahmad Kaouk <56095276+ahmadkaouk@users.noreply.github.com>
Co-authored-by: undercover-cactus <lola@moonsonglabs.com>
## Summary
- Fix Docker Hub rate limit errors in E2E CI job by adding
authentication
- Pass existing `DOCKERHUB_USERNAME` and `DOCKERHUB_TOKEN` secrets to
the E2E workflow
## Problem
The E2E CI job pulls `datahavenxyz/snowbridge-relay:latest` from Docker
Hub without authentication, causing rate limit errors (10 pulls/hour for
unauthenticated requests):
```
Error: initializing source docker://datahavenxyz/snowbridge-relay:latest: reading manifest latest in docker.io/datahavenxyz/snowbridge-relay: toomanyrequests: You have reached your unauthenticated pull rate limit.
```
## Solution
Reuse the Docker Hub secrets already configured for
`docker-build-release` by:
1. Passing secrets from `CI.yml` to the E2E workflow
2. Adding optional secrets declaration in `task-e2e.yml`
3. Adding Docker Hub login step before pulling `snowbridge-relay`
---------
Co-authored-by: Steve Degosserie <723552+stiiifff@users.noreply.github.com>
## Summary
Fixes the CI failure introduced by #349 where reusable workflows
couldn't use the permissions they declared.
## Root Cause
When using `workflow_call` (reusable workflows), the **called workflow's
permissions are constrained by the caller**. A called workflow cannot
request more permissions than the calling workflow grants.
PR #349 added explicit permissions to individual workflows (e.g.,
`actions: write` in task-build-operator.yml), but removed them from
CI.yml. This caused failures because:
```
CI.yml (contents: read only)
└── task-build-operator.yml (requests actions: write)
└── FAILS: caller doesn't grant actions: write
```
## Fix
Grant the necessary permissions in CI.yml so called workflows can use
them:
```yaml
permissions:
contents: read
actions: write # For artifact upload/download
packages: write # For ghcr.io push
```
## Why the individual workflow permissions still matter
The explicit permissions in called workflows are still valuable for:
1. **Documentation** - Makes the intent clear
2. **Direct invocation** - Works when called via `workflow_dispatch`
3. **Defense in depth** - If CI.yml grants more than needed, called
workflows still request only what they need
Co-authored-by: Claude <noreply@anthropic.com>
This PR remove the `cargo chef` step used to build the docker image used
in deployment. We noticed that `cargo chef` was adding more time to the
build and that removing it was saving us 40min.
Also in this PR, we removed the base image from parity which was really
heavy and was filling the rest of the disk space. This broke the build.
After some investigation it doesn't seem to add a lot to the build. It
has been replace with the official rust image as a base to build our
node.
The image used to run the image has been replaced with
`debian:trixie-slim`.
In the end those changed **should not** break any of the current
behavior and makes save a bit of CI time.
### Description
This PR introduces the **Moonwall** end-to-end (E2E) testing framework.
The primary motivation for this is to enable the porting of existing
Mobeam tests into the `DataHaven` repository.
### Key Changes
* **Node Manual Sealing:**
* Introduced a `--sealing=manual` flag for the `datahaven-node`. When
enabled, blocks are only produced on demand via an RPC call. This is the
core mechanism that allows for deterministic tests.
* **Moonwall Framework Integration:**
* Added `@moonwall/cli` and `@moonwall/util` dependencies to the
`test/package.json`.
* A new `test/moonwall.config.json` file configures the test
environment, defining how Moonwall should launch the `datahaven-node`
with the manual sealing flag.
* Added a `moonwall:test` script to `package.json` for running the
tests.
* **CI Workflow:**
* A new reusable workflow, `.github/workflows/task-moonwall-tests.yml`,
has been created to handle the setup, execution, and reporting of
Moonwall tests.
* The main `CI.yml` now includes a `moonwall-tests` job that runs after
the `build-operator` job, ensuring it always tests the correct,
freshly-built binary.
* **Example Test Suite:**
* A new test suite, `test/datahaven/suites/dev/test-block.ts`, had been
copied from moonbeam.
### How to Run Locally
1. Navigate to the `test` directory.
2. Install dependencies: `bun install`
3. Run the tests: `bun run moonwall:test`
---------
Co-authored-by: undercover-cactus <lola@moonsonglabs.com>
## Summary
This PR resolves all CI failures following the migration to self-hosted
GitHub runners (`DH-Testing` group) by eliminating sudo dependencies and
fixing Docker connectivity issues.
## Key Changes
### 🔧 **Eliminated sudo requirements across all workflows**
- **Setup Environment**: Installed mold linker and system dependencies
in userspace without sudo
- **Tool Installation**: Replaced apt/system package installations with
direct binary downloads:
- Kurtosis: Direct binary download from GitHub releases (v1.10.3)
- Taplo: Direct binary installation for Cargo.toml formatting
- cargo-nextest: Using `cargo install` instead of GitHub action
(v0.9.100)
- **Runner Cleanup**: Skipped cleanup-runner action entirely on
self-hosted runners (bare-metal manages disk space externally)
### 🐳 **Fixed Docker connectivity for E2E tests**
- **Enhanced dockerode configuration** with robust fallback logic for
different socket locations
- **Added DOCKER_HOST environment variable** to E2E workflow for
consistent Docker daemon access
- **Implemented connection testing** with detailed error diagnostics for
troubleshooting
- **Resolves FailedToOpenSocket errors** by supporting multiple socket
paths and connection methods
### 🏷️ **Workflow optimizations**
- **Label-based targeting**: All heavy workloads (Rust builds, E2E
tests) now run on `DH-Testing` runners
- **Dependency management**: Used `install-deps: false` flag instead of
hardcoded runner detection
- **Permission fixes**: Corrected Docker build permissions and GHCR
organization names
---------
Co-authored-by: Claude <noreply@anthropic.com>
Add CI check for Polkadot-API metadata freshness
This PR adds a new CI workflow that ensures the Polkadot-API metadata
file (`test/.papi/metadata/datahaven.scale`)
is kept up-to-date when runtime changes are made.
Changes:
- Added task-check-metadata.yml workflow that:
- Reuses the WASM artifact from the build-operator job (no duplicate
compilation)
- Runs `bun x papi add` to regenerate metadata
- Fails if the metadata file has uncommitted changes
- Integrated the check into `CI.yml` as a second-tier job alongside
`docker-build`
Why:
- Prevents TypeScript type definitions from becoming out of sync with
the runtime
- Reminds developers to run `bun generate:types:fast` when making
runtime changes
- Ensures consistent type safety across the codebase
The check provides clear error messages with instructions when metadata
is outdated.
## Changes
- New CI file for making Docker Prod images
- Changed E2E tests use an image built from a local dockerfile
- Some cargo build options to make it quicker
- Fix the cache hit rate
- added `tsgo` preview to the project 😎
- Can be invoked with `bun tsgo` to typecheck
- Install in IDE
[VSCode](https://code.visualstudio.com/docs/configure/extensions/extension-marketplace)
& [Zed](https://github.com/zed-extensions/tsgo) for super-fast inline
typechecking (as you type basically)
## Context
This PR attempts to make the frankly unacceptable CI times better. This
achieves that aim by making a crappy image for day-to-day usage and let
the prod issue take ages since that will be infrequently used. The
reason why the original design didn't work for us is because: 1) we are
using the free GH runners 2) when we goto baremetal runners we'll lose
our rapid caching abilities which make using docker cheap.
Also, we add `tsgo` support to improve devex. The improvement is
astounding.
```sh
hyperfine -n tsc "bun tsc --incremental false --extendedDiagnostics" -n tsgo "bun tsgo --incremental false --extendedDiagnostics"
Benchmark 1: tsc
Time (mean ± σ): 5.500 s ± 0.221 s [User: 8.939 s, System: 0.400 s]
Range (min … max): 5.196 s … 5.845 s 10 runs
Benchmark 2: tsgo
Time (mean ± σ): 99.1 ms ± 8.4 ms [User: 392.8 ms, System: 54.1 ms]
Range (min … max): 88.3 ms … 116.0 ms 29 runs
Summary
tsgo ran
55.48 ± 5.22 times faster than tsc
```
## Changes
- New option: `--kurtosis-enclave-name` to allow you to specify a new
ethereum network with a different enclave name. Neccesary step for
setting up local testing with multiple enclaves running at once
- Refactor: all single dash options (e.g. `-i`) have been renamed to
have double dashes `--` for consistency others
- sad times: CLI now must be invoked `bun cli launch` since we have
multiple fns now
- Package Update
- New Function: `stop` which allows you to kill all components (eth, dh,
relayers) or only a single part
- Gonza's mac target build fix
- Misc fixes to ci like caching and rate limits
## Additional Comment
The CLI needs multiple commands and this PR adds the first new one
`stop`.
This syntax is faily extensible so @ffarall 's k8 command can follow
suite.
Originally we were going to have an `exec` command to expose internal
fns to be callable by command line, i.e. generate beacon checkpoint -
however the need for it has since been superceded by the deploy to k8
command that's going to be added. I've left the code in place though so
when another usecase comes up we can just plug that in.
---------
Co-authored-by: Facundo Farall <37149322+ffarall@users.noreply.github.com>
Eventually our CI will be required to run two private blockchains
locally plus associated relayers.
This PR is to prepare for this fate by improving run times and
refactoring our existing CIs so they are a bit easier to reason about.
### Refactors
- **_We now run ALL CIs on every PR!_** This is so that we decomplexify
the logic around conditional builds and fetching built binaries from
another source. This reduces the surface area of code we have to
maintain at the cost of execution time
- This penalty is ameliorated by a layered caching system. At best, it
will be less than a minute to complete a build since everything will be
cached. On GH runners this is about 6 minutes sadly.
- We will no longer be at risk of important CIs being skipped
erroneously which hide true failures.
- Caching is a low-risk approach because at worst it has to build from
scratch. A bad cache hit will never imply the wrong thing gets build
since cargo is smart enough to just throw away any inappropriate build
artefacts.
- `setup-rust` action created so we have a unified way of setting up
runner and unifying our approach to caching
- Use a unique caching key for different activities and it will fallback
to shared cache if no matches
- we are using `mainnet` kurtosis config so that it works with relayer
assumptions
### Additions
- We can specify the ethereum block time via a new cli arg `--slot-time
<seconds>`
- We can specify arbitrary network_param args which get passed into the
generated yaml
- e.g. giving `bun cli --kurtosis-network-args="pet=cat food=fish" will
add:
```yml
network_params:
# existing params...
pet: cat
food: fish
```
- We now have the ability to programmatically modify the yaml
- This means we are back down to a single `minimal.yml` kurtosis config
so we dont have to maintain changes between them
- Flow is: `add new cli arg` -> `add if() block which mutates yaml` ->
`profit`
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Facundo Farall <37149322+ffarall@users.noreply.github.com>