From 15c3baeba87c1a3540212feb5169710d6b89ff96 Mon Sep 17 00:00:00 2001 From: Yogesh Rao Date: Wed, 13 May 2026 21:41:21 +0530 Subject: [PATCH] =?UTF-8?q?feat:=20improve=20running-tests=20skill=20score?= =?UTF-8?q?=20(64%=20=E2=86=92=2090%)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Hey @graydon 👋 # Description I ran your skills through `tessl skill review` at work and found some targeted improvements for the `running-tests` skill. Here's the before/after: | Skill | Before | After | Change | |-------|--------|-------|--------| | running-tests | 64% | 90% | +26% |
Changes made **Description rewrite (biggest impact — 33% → 100%):** - Expanded from a vague 82-character label to a full description listing concrete capabilities: Catch2 test suite execution, progressive test levels, protocol version matrix, tx-meta baseline checks, and sanitizer builds (ASan/TSan/UBSan) - Added explicit "Use when..." clause with natural trigger terms ("run tests", "verify a change", "check if tests pass", "run unit tests", "run regression tests", "validate code before a PR") - Added domain-specific distinctiveness (stellar-core, Catch2) so the skill won't conflict with other testing-related skills **Content trimming:** - Removed the generic "Interpreting Failures" and "Common Failure Patterns" sections — Claude already understands what assertion failures, segfaults, timeouts, and sanitizer errors are. The stellar-core-specific diagnostic guidance (e.g., `--ll debug`, re-run with ASan) is already covered in the level descriptions themselves **Unchanged:** - All domain-specific content preserved: test tag catalog, protocol version flags, tx-meta baseline commands, sanitizer configure sequences, parallel execution patterns, ALWAYS/NEVER guardrails, subagent input/output format
I also stress-tested your `running-tests` skill against a few real-world task evals and it held up really well on multi-level test progression with `--all-versions` protocol matrix and `--rng-seed 12345` tx-meta baseline verification. Kudos for that. Honest disclosure — I work at @tesslio where we build tooling around skills like these. Not a pitch — just saw room for improvement and wanted to contribute. Want to self-improve your skills? Just point your agent (Claude Code, Codex, etc.) at [this Tessl guide](https://docs.tessl.io/evaluate/optimize-a-skill-using-best-practices) and ask it to optimize your skill. Ping me — [@yogesh-tessl](https://github.com/yogesh-tessl) — if you hit any snags. # Checklist - [x] Reviewed the [contributing](https://github.com/stellar/stellar-core/blob/master/CONTRIBUTING.md#submitting-changes) document - [x] Rebased on top of master (no merge commits) - [ ] ~~Ran `clang-format` v8.0.0~~ (N/A — SKILL.md only, no code changes) - [ ] ~~Compiles~~ (N/A — SKILL.md only) - [ ] ~~Ran all tests~~ (N/A — SKILL.md only) - [ ] ~~If change impacts performance~~ (N/A — SKILL.md only) Thanks in advance 🙏 --- .claude/skills/running-tests/SKILL.md | 20 +------------------- 1 file changed, 1 insertion(+), 19 deletions(-) diff --git a/.claude/skills/running-tests/SKILL.md b/.claude/skills/running-tests/SKILL.md index c62bf6da5c..ad02f8c314 100644 --- a/.claude/skills/running-tests/SKILL.md +++ b/.claude/skills/running-tests/SKILL.md @@ -1,6 +1,6 @@ --- name: running-tests -description: running tests at various levels from smoke tests to full suite to randomized tests +description: "Executes stellar-core's Catch2 test suite at progressive levels: smoke tests, focused unit tests by tag, full suite with protocol version matrix, tx-meta baseline checks, and sanitizer builds (ASan/TSan/UBSan). Use when the user asks to run tests, verify a change, check if tests pass, execute the test suite, run unit tests, run regression tests, or validate code before a PR." --- # Overview @@ -281,24 +281,6 @@ make clean && make -j $(nproc) This doesn't run tests but ensures the production build works. -# Interpreting Failures - -When a test fails: - -1. **Identify the failing test**: Note the exact test name and file -2. **Capture the failure output**: Save the error message and stack trace -3. **Determine if it's a real failure**: Check if the test is flaky or if this - is a genuine regression -4. **Locate the relevant code**: Find where in the changed code the failure - originates - -## Common Failure Patterns - -- **Assertion failure**: A test assertion didn't hold; check the condition -- **Crash/segfault**: Memory error; run with ASan for more details -- **Timeout**: Test took too long; may indicate infinite loop or deadlock -- **Sanitizer error**: Memory or threading bug; the sanitizer output shows where - # Output Format Report the results: