Commit graph

74 commits

Author SHA1 Message Date
Johnny Greco
1fa29ad940
Merge branch 'main' into andreatgretel/docs/remove-code-reference-docs 2026-05-21 15:02:43 -04:00
Andre Manoel
ff5277088d
fix(ci): trust generated Agentic CI PRs (#643)
* fix(ci): trust generated agentic CI PRs

Signed-off-by: Andre Manoel <amanoel@nvidia.com>

* fix(ci): authorize generated PR checks

Signed-off-by: Andre Manoel <amanoel@nvidia.com>

* fix(ci): pin authorized agentic checks

Signed-off-by: Andre Manoel <amanoel@nvidia.com>

* fix(ci): narrow agentic CI trust

* fix(ci): reject stale agentic authorizations

* fix(ci): serialize agentic authorization

---------

Signed-off-by: Andre Manoel <amanoel@nvidia.com>
2026-05-20 09:27:04 -03:00
Andre Manoel
08ccf3412d
docs: remove docs code reference 2026-05-18 21:34:15 +00:00
dependabot[bot]
387be6f07d
ci: bump the all-actions group across 1 directory with 2 updates (#664)
Some checks are pending
CI / Test Engine (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Coverage Check (Python 3.11) (push) Waiting to run
CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
CI / End to end test (Python 3.13 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions
Bumps the all-actions group with 2 updates in the / directory: [cloudflare/wrangler-action](https://github.com/cloudflare/wrangler-action) and [NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml](https://github.com/nvidia-nemo/fw-ci-templates).


Updates `cloudflare/wrangler-action` from 3.15.0 to 4.0.0
- [Release notes](https://github.com/cloudflare/wrangler-action/releases)
- [Changelog](https://github.com/cloudflare/wrangler-action/blob/main/CHANGELOG.md)
- [Commits](9acf94ace1...ebbaa15849)

Updates `NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml` from 1.1.0 to 1.2.0
- [Release notes](https://github.com/nvidia-nemo/fw-ci-templates/releases)
- [Changelog](https://github.com/NVIDIA-NeMo/FW-CI-templates/blob/main/CHANGELOG.md)
- [Commits](2dee428461...e58924ea30)

---
updated-dependencies:
- dependency-name: cloudflare/wrangler-action
  dependency-version: 4.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: all-actions
- dependency-name: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml
  dependency-version: 1.2.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: all-actions
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-18 11:45:27 -03:00
Andre Manoel
cd604a57a4
ci: fix Fern devnotes artifact lookup (#667)
Some checks failed
CI / Test Config (Python 3.13 on macos-latest) (push) Has been cancelled
CI / Test Config (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / Test Config (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / Test Config (Python 3.12 on ubuntu-latest) (push) Has been cancelled
CI / Test Config (Python 3.13 on ubuntu-latest) (push) Has been cancelled
CI / Test Engine (Python 3.10 on macos-latest) (push) Has been cancelled
CI / Test Engine (Python 3.11 on macos-latest) (push) Has been cancelled
CI / Test Engine (Python 3.12 on macos-latest) (push) Has been cancelled
CI / Test Engine (Python 3.13 on macos-latest) (push) Has been cancelled
CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.13 on macos-latest) (push) Has been cancelled
CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.10 on macos-latest) (push) Has been cancelled
CI / Test Interface (Python 3.11 on macos-latest) (push) Has been cancelled
CI / Test Interface (Python 3.12 on macos-latest) (push) Has been cancelled
CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Has been cancelled
CI / Coverage Check (Python 3.11) (push) Has been cancelled
CI / Test (Python 3.11 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / Test (Python 3.13 on ubuntu-latest) (push) Has been cancelled
CI / Test (Python 3.10 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.12 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.13 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / Test (Python 3.12 on ubuntu-latest) (push) Has been cancelled
2026-05-15 17:45:51 -03:00
Andre Manoel
765fccfcb0
docs: fix Fern versioned publishing (#656)
* docs: fix Fern versioned plugin docs

* docs: guard Fern release version content

* docs: dedupe latest Fern release pages

* ci: require latest Fern nav on release

* docs: document Fern release prep

* ci: automate Fern release sync

* ci: publish Fern snapshots from docs branch

* docs: keep Fern archive on docs branch

* docs: harden Fern docs branch publishing

* ci: preview Fern docs from archive branch

* docs: include utility modules in Fern API reference

* ci: harden Fern devnotes publishing

* docs: keep Fern latest label stable

* docs: normalize Fern latest preview label

* docs: align Fern code reference nav

* docs: sync Fern code reference across versions

* docs: materialize Fern version pages

* ci: record Fern publish provenance

* docs: fix Fern generated API MDX

* docs: escape generated Fern API example

* ci: use stable Fern preview URL

* docs: flatten Fern API nav roots

* docs: use generated API overview pages

* ci: allow branch-dispatched Fern publish tests

* docs: update Fern CLI pin

* docs: dedupe release nav validation paths

* docs: address Fern review nits
2026-05-15 17:09:59 -03:00
Andre Manoel
1d203b1dda
feat(agentic-ci): decision-ready triage and daily PR fixes (#600)
Some checks are pending
CI / Test Config (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Coverage Check (Python 3.11) (push) Waiting to run
CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
Publish Fern devnotes / deploy (push) Waiting to run
* feat(agentic-ci): decision-ready triage and daily PR fixes

Reorganize the weekly issue-triage report around recommended actions
(close as resolved, close as duplicate, needs maintainer decision,
ready for assignment, stuck PR, duplicate PRs, stale) so each flagged
item carries action + evidence + rationale and can be resolved without
opening it. Multi-comment split with i/N markers and orphan
reconciliation when the report grows or shrinks.

Flip the four daily audit suites with mechanical fix categories from
read-only reports to opening one PR per run:

- docs-and-references: broken-link, docstring-drift, arch-ref-rename
- structure: missing-future, lazy-import
- dependencies: transitive-gap, unused
- code-quality: bare-except (draft until landing rate proven)

test-health stays report-only (all candidates require inferring intent).

The shared procedure - fix_backlog selection, finding-hash spec for
stable cross-run identification, attempted_fixes lifecycle with
two-strike escalation, allowlists, ranking, branch/PR conventions -
lives in .agents/recipes/_fix-policy.md. Each suite recipe declares
only its eligible categories, branch types, and test requirements.

Workflow runs claude twice per suite (audit, then conditionally fix),
each capped at the existing --max-turns 50. Fix call is gated on
non-empty fix_backlog and skipped entirely for test-health.

* fix(agentic-ci): address review findings before merge

- Map per-package test targets explicitly in _fix-policy.md (Makefile
  exposes test-config/test-engine/test-interface, not test-<package>).
- Use github-actions[bot] noreply identity for commits the recipes
  produce.
- Refresh fix_backlog.data when an id already exists so the fix phase
  cannot drive a PR from stale data after the underlying file changed.
- Stop time-pruning closed/abandoned attempted_fixes entries — pruning
  before the two-strike threshold erases the history needed to
  escalate. Single-strike entries now age out only via the 200-entry
  cap.
- Disambiguate bare-except findings within the same function by
  including a try-body hash in the finding id.
- Audit grep for code-quality now matches both `except:` and
  `except BaseException:`, in parity with the fix eligibility.
- Restrict transitive-gap fix eligibility to cases where a sibling
  package already declares the dep (avoids inventing version
  specifiers from scratch).
- Issue-triage workflow handles multi-part reports in both the fallback
  post step and the job summary; recipe always writes numbered parts.

* fix(agentic-ci): close residuals from review pass 2

- Replace remaining `make test-<package>` references with pointers to
  the mapping table; only the table itself uses that placeholder now.
- Fix `gh api --paginate | jq | length` returning per-page counts: slurp
  with `jq -s 'add // 0'` to get a single total.
- Compare posted-comment count to expected part count so a partial post
  (agent posted part 1 but not 2/3) triggers the fallback instead of
  being silently treated as success.
- Add `shell: bash` to triage steps using `shopt`/`mapfile` so they're
  not at the mercy of the runner's default shell.
- Disambiguate bare-except findings whose try-body hashes collide by
  adding a per-function ordinal to the canonical_key.
- Tie the 200-entry attempted_fixes cap eviction to `attempts[0].at`
  (the schema has no `first_seen` field).

* fix(agentic-ci): identity-based partial-post detection in triage fallback

Replace the count-only POSTED_COUNT >= EXPECTED_PARTS check with an
identity-based check that extracts every i/N marker seen in
today-dated bot comments and verifies each expected i is present.
A duplicate post of one part can no longer mask a missing other.

* fix(agentic-ci): close remaining bot-review findings

- Exempt two-strike attempted_fixes entries from the 200-entry cap
  eviction. Cap now evicts non-two-strike oldest-first by
  attempts[0].at; two-strike entries are silently-forgotten only in
  the pathological all-200-are-two-strike case (itself a signal).
- Specify the attempted_fixes PR-marker reconciliation algorithm:
  scan open PR bodies for the `<!-- agentic-ci finding=<id> -->`
  marker and back-fill missing entries.
- Tighten the daily workflow conditionals to gate on explicit step
  outcomes (steps.audit.outcome == 'success' rather than success())
  so a future pre-audit gate cannot accidentally trip the fix step.

* fix(agentic-ci): close Greptile pass-2 findings (timeout, re-verify wording)

- Bump daily-suite job timeout from 20 to 40 minutes. The split into
  two sequential `claude --max-turns 50` invocations can saturate a
  20-minute budget; a mid-fix SIGTERM would leave an orphaned branch
  and inconsistent runner-state.
- Disambiguate the `_phase-fix.md` "do NOT re-scan" rule. It forbids
  rebuilding fix_backlog from scratch but does NOT override the
  per-candidate re-verification step required by _fix-policy.md
  step 4.1 (re-grep / re-read the specific file the candidate points
  at). Single-candidate re-verification is required; whole-codebase
  re-scanning is forbidden.

* fix(agentic-ci): close Greptile pass-3 P1s in triage fallback

- Guard `jq capture()` with a `test()` select. `capture()` errors on
  non-match instead of returning empty, which would truncate
  SEEN_PARTS if any unrelated today-dated bot comment lacks the
  triage marker (e.g. from a sibling workflow). Adding the test()
  guard ensures capture() only runs on bodies that already match.
- Iterate the MISSING[] array when posting fallback parts, not the
  full PARTS[] array. Posting all parts when only some were missing
  was creating duplicate comments for the parts the agent already
  successfully posted.

* fix(agentic-ci): close johnnygreco review-pass warnings

Address the five Warnings from the 2026-05-07 review focused on the
trust boundary for autonomous PR generation. Five workflow/policy
adjustments shrink the surface where agent compliance is load-bearing:

- Workflow-level scope gate. After the fix step, re-derive the diff
  against `origin/main` and validate against the per-suite path
  allowlist (regex mirrored from `_fix-policy.md`), the 50-LOC cap, and
  the 3-file cap. On violation, close the PR with `--delete-branch`
  and flip the `attempted_fixes` entry from `open` to `abandoned` so
  two-strike logic still sees the failure. The recipe alone could not
  bind the agent's path choices; the workflow now does.
- Dependencies install-dev verification. For the dependencies suite
  only, re-run `make install-dev` after the scope gate so the agent's
  pyproject edit is exercised against the lockfile resolver. Closes
  the PR if `install-dev` fails — catches the failure mode where the
  per-package test target passed against the old cached lockfile.
- Flip matrix-job `cancel-in-progress` from true to false. A
  cancellation between the agent's git push and `gh pr create` would
  leave an orphaned branch with no `attempted_fixes` record;
  reconciliation only covers PRs that were opened. Queueing a
  duplicate run is the lesser evil. `_fix-policy.md` Atomicity
  section now documents the trade-off.
- Allow `/tmp/audit-{{suite}}.md` in `_phase-audit.md`'s "do not
  modify outside `{{memory_path}}/`" directive. A literal-minded
  agent could refuse to write the report file, which would break the
  job summary, artifact upload, and the fix phase's audit context.
- Always upload the agent log artifact (was `if: failure()` only) and
  include `runner-state.json`. For autonomous mode, the most
  interesting failure is "the workflow succeeded but the PR was
  wrong"; the stream-json log is the only way to look back days
  later.

Also takes johnnygreco's Suggestion 2: spell out in the policy doc
that the `draft_until_proven` flip is the sole human-gated
promotion step in the fix policy and must not be automated.

Greptile and the github-actions auto-reviewer's findings were
already closed in the prior pass-2/pass-3 commits; no action needed
on those.

* fix(agentic-ci): close Codex review-pass-2 findings on workflow gates

Codex flagged five issues in the prior commit's scope/lockfile gates.
This commit closes all five:

- HIGH: Wrong-PR targeting. Both gates selected the last globally-open
  attempted_fixes entry, which could match a stale orphan from a
  prior crashed run rather than the PR opened by *this* run. Adds a
  pre-fix snapshot step that captures `(id, attempts-length)` pairs
  before the fix runs, and changes the post-fix selectors to require
  that the entry's attempts count grew during this run.
- HIGH: Docstring-only enforcement gap on the docs-and-references
  suite. The .py path allowlist was at workflow level but the
  docstring-only caveat was still policy-only. Adds an AST-based
  check: for each .py file changed, parse the post-change tree,
  collect docstring line ranges (module/class/function), then verify
  every added line in the diff is either inside a docstring, a
  comment, or whitespace. Verified locally with both pass and fail
  fixtures.
- MEDIUM: Diff-ref mismatch. Gates diffed `origin/main...HEAD` rather
  than `origin/main...origin/$BRANCH`, so a misbehaving agent that
  left HEAD pointing elsewhere would have validated the wrong tree.
  Now fetches `origin/$BRANCH` first and prefers that ref. Falls
  back to HEAD only if fetch fails (with a warning).
- MEDIUM: FILE_COUNT bug. `grep -c '.' || echo 0` produced "0\n0" on
  empty diff, breaking the downstream integer comparison. Replaces
  with `mapfile -t FILE_ARR` + `${#FILE_ARR[@]}`, which is correct
  for any input including empty.
- LOW: Non-atomic JSON writes. The runner-state mutations could leave
  the file half-written if the workflow was cancelled mid-write.
  Switches both gates to the temp-file + os.replace pattern.

Also: dependencies-lockfile gate now does an explicit
`git checkout --detach origin/$BRANCH` before re-running install-dev,
so verification runs against what was actually pushed rather than
relying on local working-tree state.

* fix(agentic-ci): gate fix + scope_gate steps on snapshot.outcome

Greptile review on 872d5617 flagged that the fix step's custom `if:`
expression bypasses GitHub Actions' implicit success() check. Without
explicitly referencing steps.snapshot.outcome, a snapshot failure
(corrupt runner-state, disk error) would let the fix step run anyway.
The scope gate's `jq --slurpfile prior /tmp/prior-attempted-fixes.json`
would then exit non-zero on the missing file, leave OPEN empty, and
hit the "nothing to validate" early-exit — silently approving whatever
the agent pushed.

Adds steps.snapshot.outcome == 'success' to both the fix step's
condition (the actual fix) and the scope_gate step's condition
(belt-and-suspenders against future refactors).

* fix(agentic-ci): harden daily fix gates

Signed-off-by: Andre Manoel <amanoel@nvidia.com>

* fix(agentic-ci): validate all grown fix attempts

* fix(agentic-ci): harden post-fix gates

---------

Signed-off-by: Andre Manoel <amanoel@nvidia.com>
2026-05-12 18:54:01 -03:00
Andre Manoel
46dc8b232a
docs: prepare Fern docs workflow (#622)
* docs: prepare fern generated artifacts

* docs: update fern migration artifacts

* docs: leave colab notebooks unchanged

* docs: add VLM recipe cards to Fern

* docs: trim Dev Notes sidebar

* docs: collapse older Dev Notes in sidebar

* docs: add Fern publishing workflows

* docs: gate Fern publishing on check

* docs: restrict hosted previews for fork PRs

* docs: clean Fern preview URL

* docs: cancel stale preview runs

* docs: clarify devnotes notebook reuse

* docs: clean older versions route

* docs: document Fern versioning conventions

* docs: add Fern release version guard

* docs: harden Fern release tag handling

* ci: let docs preview continue after fern failure

* ci: split docs preview deploy

* docs: clarify fern make commands

* ci: harden fern deploy workflows

* docs: render preview notebooks without outputs

* ci: keep docs preview deploy inline

* docs: align notebook code highlighting

* docs: show notebook snippet scrollbars

* docs: isolate fern preview check failures

* ci: align fern release docs behavior
2026-05-12 18:18:26 -03:00
dependabot[bot]
eb0b9d3226
ci: bump NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml (#621)
Bumps the all-actions group with 1 update: [NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml](https://github.com/nvidia-nemo/fw-ci-templates).


Updates `NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml` from 0.94.1 to 1.1.0
- [Release notes](https://github.com/nvidia-nemo/fw-ci-templates/releases)
- [Changelog](https://github.com/NVIDIA-NeMo/FW-CI-templates/blob/main/CHANGELOG.md)
- [Commits](211c302d64...2dee428461)

---
updated-dependencies:
- dependency-name: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml
  dependency-version: 1.1.0
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: all-actions
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-08 17:10:38 -03:00
Andre Manoel
b502dd3575
ci: add graphify structural impact analysis to PR review and structure audit (#567)
* ci: add graphify structural impact analysis to PR review and structure audit

Add a graphify-based AST analysis tool that builds a directed graph of the
codebase (~2s, no LLM calls) to detect architectural impact. Integrates
into both the PR review workflow (pre-computed before claude runs) and the
Wednesday structure audit (with week-over-week diff).

PR review: extracts changed files against the full codebase graph, reports
risk level (LOW/MEDIUM/HIGH), god nodes affected, import direction
violations, and cross-package dependencies. Output saved to /tmp and read
by the review agent.

Structure audit: produces god node rankings, cross-package edge summary
table, import violation detection, and graph diff against previous week's
cached graph. Baselines saved for runner memory trend tracking.

* fix: harden graphify integration - security, correctness, and CI weight

- Fix KeyError: god_nodes() returns 'degree' not 'edges' (3 call sites)
- Fix deduped vs raw violation count inconsistency in baselines.json
- Security: run structural_impact.py from base-branch checkout so fork
  PRs cannot inject code that executes with GH_TOKEN in scope
- Add --repo-root flag so the tool resolves package paths correctly when
  invoked from a different checkout directory
- Replace make install-dev + .venv with lightweight /tmp/graphify-venv
  (only graphifyy needed, saves ~2min CI per PR review)
- Add graphify-out/ to .gitignore (9MB graph cache is CI-only)

* fix: pin graphifyy version and fix dedup truncation

Pin graphifyy==0.4.23 in both CI workflows to prevent
breaking changes from unpinned installs. Fix _dedup()
label truncation at 30 chars that could merge distinct
entities sharing a common prefix.

* fix(ci): use array expansion for changed-files arg to handle special filenames

Replace unquoted $CHANGED_PY word-split with mapfile + array
expansion to prevent glob expansion and correctly handle
filenames containing spaces or special characters.

* fix: derive changed nodes from graph and improve MEDIUM risk reason

Derive changed_node_ids from the already-built graph by matching
source_file paths instead of running a separate extraction pass.
Removes implicit dependency on graphify ID stability across
independent extractions.

Fix MEDIUM risk reason to reflect the actual trigger (cluster
spread vs high-connectivity entity) instead of always reporting
cluster count.

* fix: address Codex review findings - security, edge coverage, dedup, stale artifacts

Split the workflow step to isolate GH_TOKEN from graphifyy execution,
preventing a compromised package release from exfiltrating write-scoped
tokens.

Scan both edge directions in _cross_package_edges so inbound dependents
and violations where the changed node is the target are visible. Detect
deleted files and report them as a risk signal.

Include relation type in dedup key so distinct edge types between the
same labels are not collapsed.

Clean stale /tmp artifacts before running analysis to prevent reruns
from reading old reports.

* fix: address review feedback - type annotations, hoist imports, narrow except, isolate daily graphify

- structural_impact.py:
  - replace bare _build_graph dict return with frozen _Analysis dataclass
  - add G: Any annotation on _cross_package_edges (STYLEGUIDE: all params typed)
  - hoist `from graphify.export import to_json` and
    `from networkx.readwrite import json_graph` to module top
    (no perf justification for deferred import)
  - narrow `except Exception` in graph-diff fallback to
    (JSONDecodeError, KeyError, TypeError, OSError)
- agentic-ci-daily.yml: install graphifyy into /tmp/graphify-venv instead of
  the project .venv, matching agentic-ci-pr-review.yml. Keeps graphify's
  transitives (networkx) out of the project venv permanently.
- structure/recipe.md: invoke the tool via /tmp/graphify-venv/bin/python
  to match the workflow change.

* feat(ci): warn when changed files touch unknown packages

A new package under packages/ that isn't in _PACKAGE_SUBDIRS is silently
absent from the graph - the analyzer would falsely report LOW risk with
0 entities. Add a _Note line in the changed-files report when any changed
or deleted file lives under packages/<unknown>/, so the failure mode the
analyzer is supposed to surface isn't itself silent.

_KNOWN_PACKAGE_DIRS is derived from _PACKAGE_SUBDIRS so future additions
stay in sync without a second source of truth.
2026-05-05 14:47:52 -03:00
dependabot[bot]
1feb57ec03
ci: bump NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml (#596)
Some checks are pending
CI / Test Config (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Coverage Check (Python 3.11) (push) Waiting to run
CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
Bumps the all-actions group with 1 update: [NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml](https://github.com/nvidia-nemo/fw-ci-templates).


Updates `NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml` from 0.93.0 to 0.94.1
- [Release notes](https://github.com/nvidia-nemo/fw-ci-templates/releases)
- [Changelog](https://github.com/NVIDIA-NeMo/FW-CI-templates/blob/main/CHANGELOG.md)
- [Commits](38cee3a372...211c302d64)

---
updated-dependencies:
- dependency-name: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml
  dependency-version: 0.94.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: all-actions
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Andre Manoel <165937436+andreatgretel@users.noreply.github.com>
2026-05-04 11:07:46 -03:00
Andre Manoel
482ab5a224
ci: raise agent audit turn limit and preserve logs (#571)
Some checks failed
CI / Test Engine (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Coverage Check (Python 3.11) (push) Waiting to run
CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.13 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / End to end test (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
CI / End to end test (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / End to end test (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Lint and Format Check (push) Waiting to run
CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions
Publish devnotes / deploy (push) Has been cancelled
* ci: raise agent audit turn limit and preserve logs

The Friday test-health audit hit the 30-turn cap on its first-ever run
(2026-04-24) and the agent log was discarded with the self-hosted
runner. Heavier recipes need more room, and the next failure should be
diagnosable.

- Raise --max-turns from 30 to 50
- Switch --output-format from text to stream-json so events are emitted
  during the run instead of only at process exit; prefix with
  stdbuf -oL -eL to line-buffer the pipe
- Upload /tmp/claude-audit-log.txt and /tmp/audit-<suite>.md as an
  artifact (if: always(), 14-day retention) using the upload-artifact
  SHA already pinned in build-notebooks.yml

Signed-off-by: Andre Manoel <amanoel@nvidia.com>

* ci: disambiguate audit artifact name across run attempts

actions/upload-artifact@v4+ rejects duplicate names within a workflow,
and re-running a failed run reuses the same github.run_id. Append
github.run_attempt so re-runs upload successfully instead of failing at
the exact moment the artifact is most useful.

Found by Codex review of #571.

Signed-off-by: Andre Manoel <amanoel@nvidia.com>

* ci: only upload agent log on failure

Raise the bar for persisting the full verbose stream-json event log:
we only need it when we're actually debugging a failure, and the audit
report itself still lands in the step summary on success. Shrinks the
window where tool inputs, read file contents, or other verbose-stream
detail could end up in a 14-day artifact.

Addresses the minor privacy finding from Codex review of #571.

Signed-off-by: Andre Manoel <amanoel@nvidia.com>

* ci: drop raw agent log from job summary

With --output-format stream-json the previous tail -100 of the agent
log emitted raw NDJSON into the GH Actions UI summary, which is
unreadable. The audit report itself (/tmp/audit-<suite>.md) already
carries the human-readable payload, and the full event stream is
available as an on-failure artifact, so the raw tail was redundant and
worse than nothing for the summary surface.

Also rewords the fallback message to point at the artifact when no
report lands (typically a failure).

Signed-off-by: Andre Manoel <amanoel@nvidia.com>

---------

Signed-off-by: Andre Manoel <amanoel@nvidia.com>
2026-04-28 15:53:48 -03:00
dependabot[bot]
8266eb79a9
ci: bump the all-actions group across 1 directory with 5 updates (#558)
* ci: bump the all-actions group with 5 updates

Bumps the all-actions group with 5 updates:

| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `4` | `6` |
| [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) | `7.6.0` | `8.1.0` |
| [actions/cache](https://github.com/actions/cache) | `5.0.4` | `5.0.5` |
| [cloudflare/wrangler-action](https://github.com/cloudflare/wrangler-action) | `3.14.1` | `3.15.0` |
| [NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml](https://github.com/nvidia-nemo/fw-ci-templates) | `0.88.1` | `0.93.0` |


Updates `actions/checkout` from 4 to 6
- [Release notes](https://github.com/actions/checkout/releases)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)

Updates `astral-sh/setup-uv` from 7.6.0 to 8.1.0
- [Release notes](https://github.com/astral-sh/setup-uv/releases)
- [Commits](https://github.com/astral-sh/setup-uv/compare/v7.6...08807647e7069bb48b6ef5acd8ec9567f424441b)

Updates `actions/cache` from 5.0.4 to 5.0.5
- [Release notes](https://github.com/actions/cache/releases)
- [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md)
- [Commits](668228422a...27d5ce7f10)

Updates `cloudflare/wrangler-action` from 3.14.1 to 3.15.0
- [Release notes](https://github.com/cloudflare/wrangler-action/releases)
- [Changelog](https://github.com/cloudflare/wrangler-action/blob/main/CHANGELOG.md)
- [Commits](da0e0dfe58...9acf94ace1)

Updates `NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml` from 0.88.1 to 0.93.0
- [Release notes](https://github.com/nvidia-nemo/fw-ci-templates/releases)
- [Changelog](https://github.com/NVIDIA-NeMo/FW-CI-templates/blob/main/CHANGELOG.md)
- [Commits](2a49420d5a...38cee3a372)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: all-actions
- dependency-name: astral-sh/setup-uv
  dependency-version: 8.1.0
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: all-actions
- dependency-name: actions/cache
  dependency-version: 5.0.5
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: all-actions
- dependency-name: cloudflare/wrangler-action
  dependency-version: 3.15.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: all-actions
- dependency-name: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml
  dependency-version: 0.93.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: all-actions
...

Signed-off-by: dependabot[bot] <support@github.com>

* ci: pin actions/checkout to SHA in agentic-ci-issue-triage

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Andre Manoel <amanoel@nvidia.com>
2026-04-21 15:21:19 -03:00
Andre Manoel
addece9828
fix(ci): grant permissions to reusable workflow calls in build-docs and pack-tutorials (#561)
The top-level `permissions: {}` added in #517 restricts all jobs to zero
permissions by default. The `build-notebooks` jobs that call the reusable
workflow did not override this, so GitHub Actions refused to start them
(startup_failure). Add the required `actions: read` and `contents: write`
permissions to both calling jobs.

Fixes the v0.5.7 release docs build failure.
2026-04-21 12:48:29 -03:00
Andre Manoel
b220f3697b
ci: add daily audit suites with 5 rotating recipes and scheduled workflow (#543)
* ci: add daily audit suites with 5 recipes and scheduled workflow

Add the daily maintenance infrastructure (Phase 2+3 of the agentic CI
plan). A new workflow runs one audit suite per weekday via day-of-week
rotation, with runner memory persisted via actions/cache.

Recipes: docs-and-references (Mon), dependencies (Tue), structure (Wed),
code-quality (Thu), test-health (Fri). Each targets gaps that CI and ruff
don't cover: cross-reference validation, transitive dep analysis, lazy
import compliance, complexity trends, and test-to-source mapping.

Reports go to the Actions step summary. Code changes use /create-pr.

* ci: add executable smoke checks and harden runner memory

Add executable smoke checks to test-health and code-quality recipes
that exercise real code paths (config build, validate, import timing,
registry completeness, error hierarchy, input rejection) without
needing an LLM provider. Checks are split into fixed canaries (same
every run) and creative checks (agent varies inputs each run).

Harden runner memory: define JSON schema in _runner.md with TTL and
size rules, validate state file after agent runs, only update
last_run on success, drop unused audit-log.md. Add make install-dev
workflow step so recipes can run Python against the installed packages.

* ci: fix codex review findings - test paths, provider check, step gating

Fix issues found by Codex review:
- Fix test paths: tests/ does not exist at repo root, use
  packages/*/tests/ and packages/data-designer/tests/test_import_perf.py
- Remove DataDesigner(model_providers=[]) from smoke checks - raises
  NoModelProvidersError; keep config-layer checks only
- Fix audit step gating: remove continue-on-error, use step outcome
  to gate runner memory update (|| true + continue-on-error made the
  step always "succeed", defeating the success() condition)

* ci: fix review findings - heredoc, state validation, lazy import wording

Fix heredoc with indented EOF terminator that never terminates - replace
with printf. Run state validation on all outcomes (not just success) so
corrupted state from a failed audit is caught before caching. Only stamp
last_run when audit succeeds. Align test-health lazy import section with
its own Constraints (report count only, don't duplicate structure audit).

Also fixes datetime.utcnow() deprecation and shell variable injection
in Python string by using os.environ instead.
2026-04-17 14:48:55 -03:00
Andre Manoel
6ef49538a4
fix: use pull_request_target for agentic CI on fork PRs (#541)
* fix: use pull_request_target for agentic CI on fork PRs

* fix: read recipe files from base branch to prevent prompt injection

Recipe files define the agent's prompt. When using pull_request_target,
the fork's HEAD is checked out, so a malicious fork could craft recipe
files to exfiltrate API secrets via prompt injection. Fix by adding a
second sparse checkout from the base branch for .agents/recipes/ and
reading prompts from there instead of the fork tree.

* fix: align actions/checkout version for base-recipes checkout

Match the base-branch recipe checkout to v6.0.2 (same SHA as the PR
branch checkout) for consistency.

* fix: move expression interpolations to env vars in gate and review jobs

Replace direct ${{ }} interpolation in run: blocks with env vars.
Most values are GitHub-controlled, but github.event.label.name can
contain arbitrary characters and could break shell quoting. Moving
everything to env: is consistent with the injection-hardening pattern
applied in the rest of the workflow.
2026-04-15 19:11:29 -03:00
Andre Manoel
f267e19a60
fix(ci): replace yq with Python nav patching in publish-devnotes (#548)
The yq JSON roundtrip was mangling the entire mkdocs.yml file
(indentation, quoting, comments), causing mike deploy to fail.

Extract a Python script that surgically replaces only the Dev Notes
nav block, leaving all other content byte-identical.
2026-04-14 16:03:49 -03:00
Andre Manoel
1a237d95d0
fix: text-to-sql devnote date, images, and publish-devnotes nav (#546)
- Update post date from 2026-03-11 to 2026-04-14 so it appears as the
  newest post on the devnotes page.
- Replace raw <img> tags with markdown image syntax so mkdocs rewrites
  relative paths correctly for the blog plugin's slug-based URLs.
- Overlay mkdocs.yml from HEAD in publish-devnotes workflow so new nav
  entries are included in devnotes-only rebuilds.
2026-04-14 15:48:23 -03:00
dependabot[bot]
abe5c2d177
ci: bump the all-actions group with 5 updates (#539)
* ci: bump the all-actions group with 5 updates

Bumps the all-actions group with 5 updates:

| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `4.3.1` | `6.0.2` |
| [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) | `7.6.0` | `8.0.0` |
| [actions/download-artifact](https://github.com/actions/download-artifact) | `7.0.0` | `8.0.1` |
| [actions/upload-artifact](https://github.com/actions/upload-artifact) | `6.0.0` | `7.0.1` |
| [NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml](https://github.com/nvidia-nemo/fw-ci-templates) | `0.65.12` | `0.88.1` |


Updates `actions/checkout` from 4.3.1 to 6.0.2
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4.3.1...de0fac2e4500dabe0009e67214ff5f5447ce83dd)

Updates `astral-sh/setup-uv` from 7.6.0 to 8.0.0
- [Release notes](https://github.com/astral-sh/setup-uv/releases)
- [Commits](37802adc94...cec208311d)

Updates `actions/download-artifact` from 7.0.0 to 8.0.1
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](37930b1c2a...3e5f45b2cf)

Updates `actions/upload-artifact` from 6.0.0 to 7.0.1
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](b7c566a772...043fb46d1a)

Updates `NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml` from 0.65.12 to 0.88.1
- [Release notes](https://github.com/nvidia-nemo/fw-ci-templates/releases)
- [Changelog](https://github.com/NVIDIA-NeMo/FW-CI-templates/blob/main/CHANGELOG.md)
- [Commits](21f18ae8b6...2a49420d5a)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: 6.0.2
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: all-actions
- dependency-name: astral-sh/setup-uv
  dependency-version: 8.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: all-actions
- dependency-name: actions/download-artifact
  dependency-version: 8.0.1
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: all-actions
- dependency-name: actions/upload-artifact
  dependency-version: 7.0.1
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: all-actions
- dependency-name: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml
  dependency-version: 0.88.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: all-actions
...

Signed-off-by: dependabot[bot] <support@github.com>

* ci: skip docs preview deploy for Dependabot PRs

GitHub does not expose repository secrets to Dependabot PRs, so the
Cloudflare Pages deploy always fails with a missing API token. Skip the
entire job when the actor is dependabot[bot].

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Andre Manoel <amanoel@nvidia.com>
Co-authored-by: Andre Manoel <165937436+andreatgretel@users.noreply.github.com>
2026-04-13 20:28:38 -03:00
Andre Manoel
82c1a69739
ci: add PR hygiene automation (linked issue check + stale PR cleanup) (#521)
* ci: add PR hygiene automation (linked issue check + stale PR cleanup)

Add two workflows to enforce contribution quality and clean up abandoned PRs:

- pr-linked-issue.yml: required status check that validates external PRs
  reference a triaged issue. Collaborators bypass. Re-triggers automatically
  when a maintainer adds the `triaged` label to the linked issue.

- pr-stale.yml: daily cron that reminds authors of failing checks after 7/14
  days of inactivity and auto-closes after 14/28 days (external/collaborator).
  Respects `keep-open` label.

New labels created: `triaged`, `task`, `keep-open`.

Closes #518
Signed-off-by: Andrea Manoel <amanoel@nvidia.com>

* ci: add agentic repository triage workflow

Add a weekly scheduled workflow that uses Claude to triage all open issues
and PRs, producing a combined dashboard report on a pinned tracking issue.

- New recipe (.agents/recipes/issue-triage/) classifies issues, checks
  staleness, cross-references merged PRs, detects duplicates, and flags
  PR health problems (missing linked issues, failing checks, orphaned PRs)
- New workflow (.github/workflows/agentic-ci-issue-triage.yml) runs every
  Monday 10:00 UTC on the agentic-ci runner, with manual dispatch support
- pr-stale.yml now adds needs-attention label to linked issues when a PR
  is auto-closed, bridging the two workflows via labels

* docs: document stale PR policy and auto-retrigger in CONTRIBUTING.md

* fix: address review findings in PR hygiene workflows

- pr-linked-issue: fix comment gate so failure comments are posted
- pr-stale: upgrade issues permission to write for labeling
- pr-stale: compare reminder timestamp against last activity so
  push/comment actually resets the stale timer

* fix: use --body-file in retrigger job to avoid shell quoting issues

PR bodies with backticks or unmatched quotes would break the
gh pr edit --body "$NEW_BODY" call. Write to a temp file and
use --body-file instead.

* fix: retrigger job drops PRs after the first

jq outputs newline-separated numbers but GITHUB_OUTPUT only
preserves the first line. Convert to space-separated so the
for loop processes all matching PRs.

* fix: harden workflows against shell injection

- Move attacker-influenced values (${{ user.login }}, step outputs)
  from expression interpolation in run: blocks to env vars
- Replace echo "$PR_BODY" | grep with write-to-file + grep-file
  to avoid shell expansion of untrusted PR body content
- Same treatment for PR body handling in retrigger and stale jobs

* refactor: replace peter-evans actions with gh api calls

Remove peter-evans/find-comment and peter-evans/create-or-update-comment
third-party action dependencies. Replace with gh api calls for finding,
creating, updating, and deleting bot comments. Eliminates supply chain
risk from unpinned third-party actions.

* docs: add pull_request_target security comment

---------

Signed-off-by: Andrea Manoel <amanoel@nvidia.com>
2026-04-13 20:26:02 -03:00
Andre Manoel
aee3d3ff90
ci: publish devnotes independently of releases (#536)
* ci: add workflow to publish devnotes independently of releases

Adds a GitHub Actions workflow that rebuilds the `latest` docs alias
when devnotes change on main, so blog posts go live without cutting
a package release.

* ci: pin actions to commit SHAs and restrict default permissions

Address Greptile review findings:
- Pin checkout, setup-uv, and download-artifact to commit SHAs
  matching the pattern from #517
- Add top-level permissions: {} to restrict default token scope

* ci: build devnotes from last deployed state, not main

Instead of building the full site from main (which could include
unreleased docs), checkout the commit that latest was last built
from (tracked in gh-pages commit messages) and overlay only
docs/devnotes/ from main. Download notebooks from the last
successful build-docs run instead of rebuilding them.

* ci: add actions:read permission for notebook download

The gh run list/download calls need actions:read on GITHUB_TOKEN,
which is denied by the top-level permissions: {} block.
2026-04-13 14:39:11 -03:00
Andre Manoel
47be28c799
fix: tune Dependabot config and fix DCO assistant bugs (#534)
* fix: restrict Dependabot pip updates to security-only

The Dependabot config added in #517 included weekly version-bump PRs for
all three pip packages. This would generate noisy PRs for routine dep
updates we don't need. Set open-pull-requests-limit: 0 on the pip
ecosystems so only CVE-triggered security updates open PRs.

GitHub Actions weekly bumps are kept as-is to keep SHA pins current.

* fix: group Dependabot Actions PRs and fix DCO allowlist

- Add a Dependabot group to bundle all GitHub Actions updates into a
  single weekly PR instead of one per action
- Fix DCO allowlist: dependabot -> dependabot[bot] to match the actual
  GitHub username (the old value never matched, but there were no
  Dependabot PRs before #517 to expose the bug)

* fix: align DCO assistant if-condition with custom sign-off text

The step's if-condition checked for the default sign-off text but
custom-pr-sign-comment uses different wording. This meant the
issue_comment trigger was always skipped - sign-offs only worked
by accident when a subsequent push re-triggered the action via
pull_request_target.
2026-04-13 12:12:26 -03:00
Andre Manoel
54d51bdf89
chore: harden CI supply chain (#517)
* ci: harden CI supply chain

Pin all GitHub Actions to commit SHAs to prevent tag-based supply chain
attacks (same class as CVE-2025-30066). Replace softprops/action-gh-release
(single-maintainer, no security policy) with gh CLI. Add top-level
permissions: {} to all workflows that lacked it, enforcing least-privilege
by default. Enable Dependabot for GitHub Actions and pip dependencies.

Closes #471

* fix: add dependabot pip entries for each sub-package

The root directory has no pyproject.toml; the actual packages live under
packages/data-designer-config, packages/data-designer-engine, and
packages/data-designer.
2026-04-13 10:34:26 -03:00
Andre Manoel
13cd6879bb
fix: narrow docs-preview workflow path filter (#515)
The docs-preview workflow triggered on all source code changes due to
the broad `packages/*/src/data_designer/**` path glob. This caused
unnecessary Cloudflare Pages deployments on code-only PRs like #505.

Remove the source code path filter so the workflow only triggers on
actual docs content changes (docs/**, mkdocs.yml, and the workflow
file itself).
2026-04-09 16:51:10 -03:00
Andre Manoel
0e90ea644b
docs: add async engine dev note (#490)
* fix: address review feedback on async engine dev note

- Fix wall-clock claim: 41% -> 22% to match benchmark table
- Fix dual-model speedup rounding: 1.7x -> 1.6x (10.0/6.1 = 1.64)
- Fix run_config API: use dd.set_run_config() instead of passing to create()

* docs: add async engine dev note

Add "Async All the Way Down" dev note covering the async task-queue
scheduler built across PRs #356, #378, #404, #429, #456. Includes
benchmark results, architecture diagrams, and DAG shape illustrations.

* feat: add docs preview workflow for PRs

Build MkDocs site on PRs that touch docs and deploy to Cloudflare
Pages. Each PR gets a browseable preview URL posted as a comment.
Notebook tutorials use placeholder stubs since they require API
keys to execute.

Requires CLOUDFLARE_API_TOKEN and CLOUDFLARE_ACCOUNT_ID repo secrets.

* fix: update speedup chart alt text from 1.7x to 1.6x

* docs: improve timeline figure context and labeling

Add DAG subtitle to sync-vs-async timeline figure and bridge the
surrounding text to explain which workload shape is being shown.

* edits+additions to async-all-the-way-down dev notes

* clarify two semaphore dance

* remove dead link

* replace hero image

* docs: update scale figures with nginx-accurate data and adjust sizing

Regenerate scale-model-timeline and scale-boxplot from nginx access
logs (column_progress.csv, sync/summary.json) instead of buffered
execution logs. Optimize both PNGs to palette mode. Adjust figure
widths and update model timeline commentary.

* add link from owning-the-model-stack to async-dev-node

* docs: address review feedback on async blog post

- Tighten intro to a concise abstract, move pipeline narrative into
  "The Bottleneck Was Structural" section
- Remove multi-column generators / seed readers paragraph (TMI)
- Clarify sync engine ran columns sequentially within each batch

---------

Co-authored-by: Nabin Mulepati <nmulepati@nvidia.com>
2026-04-08 15:51:04 -03:00
Andre Manoel
6b92351682
ci: add PR review workflow and recipe for agentic CI (#498)
* ci: add PR review workflow and recipe

Add the remaining Phase 1 deliverables for the agentic CI plan:

- PR review recipe that composes the existing review-code skill
- PR review workflow with collaborator-only gate, auth detection,
  pre-flight checks, and re-review label support
- Mark Phase 1 items complete in the plan (except docs)

* fix: use explicit draft == false instead of ! operator in workflow if

The ! operator in a >- YAML block may cause parsing issues. Use
explicit comparison instead.

* fix: address review feedback + simplify if condition for debugging

- Fix: only re-review label triggers on labeled events (greptile)
- Fix: use printf instead of echo -e for prompt assembly (greptile)
- Debug: simplify if condition to isolate why job is skipping

* debug: set if to true to test runner connectivity

* debug: add job to dump event context for if-condition debugging

* fix: use collaborator API for permission check instead of author_association

author_association in webhook payloads reports NONE when org membership
is private, causing the job to skip even for members. Replace with a
gate job that checks collaborator permissions via the API, which works
regardless of org visibility settings.

* fix: disable prompt caching and skip posting on review failure

- Set DISABLE_PROMPT_CACHING=1 for Bedrock-backed endpoints that don't
  support cache_control parameters
- Don't post a comment when the review file isn't produced, just emit
  a warning annotation on the workflow run

* fix: rename label to agent-review, remove synchronize trigger

- Rename re-review -> agent-review for clarity
- Remove synchronize from trigger types so reviews are opt-in on
  subsequent pushes (use the agent-review label to retrigger)
- Reviews still auto-run on PR open and draft -> ready transitions

* fix: validate PR number input and remove unused auth mode step

* fix: address review feedback - quoting, checkout ordering, stale docs

- Pass all step outputs through env vars instead of direct expression
  injection in shell (PR number, model name)
- Resolve head SHA before checkout so dispatch doesn't clone at wrong ref
- Use set -o pipefail + continue-on-error instead of || true
- Remove stale synchronize references from plan doc

* fix: add specific review guidance for plan docs

* fix: check labeler permission for agent-review on external PRs

For labeled events, check the sender (who added the label) instead of
the PR author. This lets maintainers authorize agent reviews on PRs
from external contributors by adding the agent-review label.
2026-04-07 21:47:42 -03:00
Nabin Mulepati
4768a3671d
chore: plan 427, PR 2 of agent-first development plan (#478)
* save progress

* undo review-code skill change

* delete status file

* small tweaks

* Fix 429 info

* update workind on skill info

* updates

* Update architecture/overview.md

Co-authored-by: Johnny Greco <jogreco@nvidia.com>

* fix: correct symbol names and CLI commands in architecture docs

Address review comments:
- models.md: describe clients as native httpx adapters, not SDK wrappers
- agent-introspection.md: use actual family keys (columns, samplers, etc.) not column-types
- cli.md: use correct command `data-designer config models`
- plugins.md: SEED_READER not SEED_SOURCE, inject_into_processor_config_type_union

Made-with: Cursor

---------

Co-authored-by: Johnny Greco <jogreco@nvidia.com>
2026-04-06 15:26:33 -06:00
Andre Manoel
0d80858b60
fix: use --bare and --tools in health probe CLI check (#489)
The "Verify Claude CLI" step fails on the CI runner because Claude
Code tries to initialize keychain, LSP, plugins, and CLAUDE.md
discovery before making the API call. On a bare runner these
resources don't exist, causing exit code 1.

- Add --bare to skip all initialization and force ANTHROPIC_API_KEY auth
- Add --tools "" to disable tool definitions (health check doesn't need
  them, and this avoids sending a large payload to the gateway)
2026-04-02 13:48:32 -03:00
Andre Manoel
5265745335
ci: add agentic CI plan, health probe workflow, and recipe scaffold (#473)
* docs: add agentic CI plan for automated PR reviews and daily maintenance

Closes #472

* docs: add API configuration and auth modes to agentic CI plan

* docs: add PoC lessons and operational details to agentic CI plan

* docs: add runner label targeting to agentic CI plan

* docs: add re-review label and workflow_dispatch triggers to PR review

* docs: rename runner label to agentic-ci

* docs: add check run as gate for PR review, output stays as comment

* ci: add agentic CI health probe workflow and recipe scaffold

- Health probe: pings inference API, checks latency, verifies Claude CLI
- Runs every 6h on self-hosted agentic-ci runner, plus manual dispatch
- Dual auth mode: custom endpoint (secret) or OAuth fallback
- Recipe scaffold: _runner.md shared context, health-probe recipe
- Update .agents/README.md to include recipes directory

* docs: address Greptile review feedback on agentic CI plan

- Add checks: write to recipe frontmatter example
- Add concurrency group to daily maintenance workflow spec
- Clarify fork PRs are out of scope (pull_request event only)
- Document workflow_dispatch callers as trusted (accepted risk)

* fix: skip API curl in OAuth mode, add branch protection note

- Health probe: skip the direct API ping step in OAuth mode (no API
  key available for curl; Claude CLI step is the sole health signal)
- Guard latency threshold check on custom auth mode
- Plan: note that contents:write on daily suites requires branch
  protection rules to prevent agent self-merging

* fix: address Nabin's second review feedback

- Health probe: fix latency threshold string comparison with fromJSON()
- Health probe: add permissions: contents: read
- Health probe: fail fast if AGENTIC_CI_MODEL variable is not set
- Runner context: add prompt-injection defense and output sanitization
- Plan: update Phase 2 deliverable to match cache-based memory approach
- Plan: reference STYLEGUIDE.md in code-quality suite
- README: note that recipes don't need a .claude/ symlink

* docs: sync plan with implementation decisions

- Health probe uses workflow failure, not issue open/close
- Pre-flight checks should fail fast on missing config
- Add GHA string comparison gotcha to PoC lessons
- Add explicit permissions block recommendation to PoC lessons
- Bump max_turns from 20 to 30 in recipe example

* docs: address PR review feedback on agentic CI plan

- Review docs PRs with lighter recipe instead of skipping by file type
- Switch runner memory from committed branch to GH Actions cache
- Add import perf check to test-health suite
- Add nuance on dependency pinning strictness vs DX
- Add Follow-up: Weekend Agents section (perf, AI-QA, repo triage)
- Add cost guardrails open question
- Add status field to frontmatter
2026-04-01 16:43:31 -03:00
oliver könig
8ca8e2447b
ci: upgrade GitHub Actions for Node.js 24 compatibility (#450)
* ci: upgrade GitHub Actions for Node.js 24 compatibility

Upgrades actions to versions compatible with the Node.js 24 runtime:
- actions/checkout: → v6
- actions/upload-artifact: → v6
- actions/download-artifact: → v7
- actions/github-script: → v8
- actions/setup-python: → v6

Mirrors: 1d5e68b074
Signed-off-by: oliver könig <okoenig@nvidia.com>

* ci: also upgrade actions/cache and astral-sh/setup-uv to node24-compatible versions

- actions/cache: v4 → v5 in build-notebooks.yml
- astral-sh/setup-uv: v5/v6 → v7 in ci.yml, check-colab-notebooks.yml, health-checks.yml, build-docs.yml, build-notebooks.yml

Addresses: https://github.com/NVIDIA-NeMo/DataDesigner/pull/450#issuecomment-4154872141

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Andre Manoel <165937436+andreatgretel@users.noreply.github.com>
2026-03-30 17:39:05 -03:00
Andre Manoel
2564834a47
fix: cache notebook builds to avoid flaky upstream model failures (#370)
* fix: cache notebook builds to avoid failures from flaky upstream models

The build-notebooks CI executes all tutorial notebooks on every run.
When an upstream model (e.g. black-forest-labs/flux.2-pro) is down, the
entire docs build fails even if no notebooks changed.

Add per-notebook caching based on source file SHA-256 hashes. Unchanged
notebooks are served from cache, and only modified ones are re-executed.
On the first CI run (empty cache), the workflow seeds the cache from the
last successful build artifact.

Also add a minimal test script (test_flux_image_gen.py) to reproduce the
flux.2-pro health check failure locally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review comments on notebook caching

- Don't write .sha256 during seeding so changed notebooks are detected
- Rename TMPDIR to SEED_TMPDIR to avoid shadowing the POSIX env var
- Use portable sha256 helper (sha256sum with shasum fallback)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: only seed cache when truly empty, restore hash writing

Skip artifact seeding when a partial cache was restored (it already has
correct per-file hashes). Only seed + write current hashes when the
cache dir is completely empty (true bootstrapping).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: restrict artifact seed lookup to main branch

Prevents seeding from feature branch runs that may have different
notebook sources.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: add actions:read permission for artifact seeding

The seed step uses gh run list and gh run download which require
actions:read. Without it, these calls silently fail and the cold-start
cache bootstrapping never executes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: only use notebook cache when called from build-docs

Scheduled Monday runs and manual workflow_dispatch should execute all
notebooks to catch regressions (e.g. library changes that break a
notebook). Caching is only used via workflow_call (from build-docs)
where the goal is fast, resilient doc deployment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use jq // empty to avoid "null" string on empty run list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add use_cache input flag to notebook and docs workflows

Replace event_name-based cache logic with an explicit use_cache boolean
input. Defaults:
- build-notebooks: workflow_call=true, dispatch=false, schedule=false
- build-docs: dispatch=true (toggleable), release=false

This gives full control over caching from the GitHub Actions UI.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-05 12:30:14 -03:00
Andre Manoel
46358461ee
fix: repair notebook CI (dead model, missing API key, pyarrow type bug) (#348)
* fix: repair notebook CI by replacing dead vision model and adding missing API key

- Replace `meta/llama-4-scout-17b-16e-instruct` (no longer serving on
  build.nvidia.com) with `nvidia/nemotron-nano-12b-v2-vl` (project default)
  in tutorial notebook 4
- Add `OPENROUTER_API_KEY` to the `build-notebooks` workflow so notebooks
  5 and 6 (which use OpenRouter for image generation) can authenticate
- Regenerate colab notebooks to reflect the model change

* fix: handle pyarrow list types in notebook 6 display_image

When image columns are loaded from parquet with pyarrow backend,
list values are pyarrow ListScalars, not Python lists. The
isinstance(x, list) check fails, causing the whole ListScalar to be
treated as a single path string (producing filenames ending in
`png')]`). Use isinstance(x, str) instead to correctly handle any
iterable type.
2026-02-23 13:27:47 -03:00
Andre Manoel
58734d09f0
test: add provider health checks script and CI workflow (#301)
* test: add e2e health checks for default provider models

Add parametrized tests that verify model connectivity for all
default providers (nvidia, openai, openrouter). Tests check API
key availability and skip when not configured.

* chore: move health checks out of e2e tests

- Convert pytest test to standalone script at scripts/health_checks.py
- Add `make health-checks` target
- Add CI workflow (weekly + on release + manual dispatch)
- Remove test_health_checks.py from tests_e2e/

* chore: make health checks non-blocking in CI

* fix: print traceback to stdout to avoid interleaving

* chore: add all provider API keys to health checks CI

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore: remove temporary push trigger from health checks

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-06 15:18:35 -03:00
Andre Manoel
2630201a37
chore: add CODEOWNERS for automatic PR review assignment (#251)
Assigns @NVIDIA-NeMo/data_designer_reviewers as default reviewers
for all pull requests.
2026-01-28 11:04:37 -03:00
Johnny Greco
c19f35639f
chore: add publish script and update license headers (#253) 2026-01-28 08:47:34 -05:00
Johnny Greco
ae0665fa16
refactor: slim package refactor into three subpackages (#240)
* remove old structure

* major shuffle

* streamline project configs

* update make commands

* updates to make commands

* remove essentials

* initialize logger in interface

* uv lock

* ignore notepad

* update workflows

* fix e2e project config

* generate colab notebooks

* resolve default model settings in interface

* fix build commands

* update perf import make command

* cleaning up some slop

* update recipes

* move conftest files to tests/

* update subpackage readmes

* streamline config_logging

* use exports

* update perf import usage pattern

* update for IDE behavior with ruff

* remove engine's fixtures file

* add note to about lazy imports

* update dependencies

* update docs

* doc fixes

* uv lock

* updates to catch up with main

* clean up makefile

* remove package gitignores

* define deps only once

* isolate tests

* add test for protetion rule

* create temp dirs for isolated tests

* catch up to main

* update headers

* re apply changes

* better result summaries for isolated tests

* move exports into top-level init

* fix client importlib version syntax

* catch up with main
2026-01-27 13:53:20 -05:00
Johnny Greco
1ea824c692
chore: minor issue template tweaks (#198)
* tweaks

* update placeholder
2026-01-12 15:34:10 -05:00
Johnny Greco
738b183bfd
add templates (#197) 2026-01-12 13:18:05 -05:00
Mike Knepper
2300230346
chore: Relax rich upper bound to allow 14.x series (#196)
* Bump rich to 14.x series

* Disable uv cache in CI e2e tests

* Accept rich 13
2026-01-12 09:44:46 -06:00
Johnny Greco
f8c201e085
chore: update header script to check for diffs (#195)
* update script

* update headers

* refactor a bit and add test script

* update headers

* update for edge case

* update headers

* add step to get file creation date

* use git history to get copyright year

* generation type is printed with inference parameters

* fix unit test
2026-01-09 17:10:58 -05:00
Mike Knepper
2cfff52581
feat: Seed reader plugins (#191) 2026-01-09 13:50:47 -06:00
Johnny Greco
82fbbf1d45
force py11 (#170) 2026-01-05 16:57:02 -05:00
Andre Manoel
7fa9a413ac
docs: add option to open notebook directly in Colab (#126) 2025-12-12 15:15:26 -03:00
Andre Manoel
9547b6854a
fix: add git user/email and allow manual trigger for docs pipeline (#105)
* fix: add git user/email and allow manual trigger for docs pipeline

* add push as trigger temporarily

* fetching branch

* removing push from trigger
2025-12-08 13:52:37 -03:00
Andre Manoel
275bbbf646
docs: add versioning using mike (#102)
* initial changes

* fix to override, adapting ci
2025-12-08 11:06:24 -03:00
Andre Manoel
fa86be1eae
fix: allow docs CI to be manually triggered, better download button (#99) 2025-12-04 14:48:16 -03:00
Andre Manoel
279299f2dc
fix: update Python version to 3.11 on build notebooks CI (#96) 2025-12-04 09:45:18 -03:00
Andre Manoel
5d4ad10b11
chore: moving notebooks to jupytext and cleaning up workflows (#91)
* adding basic jupytext structure

Co-authored-by: Johnny Greco <jogreco@nvidia.com>

* few fixes

* first test for ci

* adding error intentionally to check workflow behavior

* test calling from other workflows

* typo

* trying as job instead

* couple of fixes

* checking path

* trying to fix path

* wrapping up

---------

Co-authored-by: Johnny Greco <jogreco@nvidia.com>
2025-12-03 17:29:07 -03:00
Andre Manoel
ce0fc0805a
docs: streamlining tutorials (#61)
* first attempt

* typo

* it works! cleaning up

* adding trigger again just to run once

* cleanup

* typo
2025-11-21 16:14:48 -03:00
Johnny Greco
dbe165723e
chore: add 3.10 to ci (#39)
* add 3.10 to ci

* strenum update for 3.10

* update type hint for 3.10

* import Self from type extensions for 3.10
2025-11-17 10:44:04 -05:00