From 15c3baeba87c1a3540212feb5169710d6b89ff96 Mon Sep 17 00:00:00 2001
From: Yogesh Rao <yogesh-tessl@users.noreply.github.com>
Date: Wed, 13 May 2026 21:41:21 +0530
Subject: [PATCH] =?UTF-8?q?feat:=20improve=20running-tests=20skill=20score?=
 =?UTF-8?q?=20(64%=20=E2=86=92=2090%)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Hey @graydon 👋

# Description

I ran your skills through `tessl skill review` at work and found some targeted improvements for the `running-tests` skill.

Here's the before/after:

| Skill | Before | After | Change |
|-------|--------|-------|--------|
| running-tests | 64% | 90% | +26% |

<details>
<summary>Changes made</summary>

**Description rewrite (biggest impact — 33% → 100%):**
- Expanded from a vague 82-character label to a full description listing concrete capabilities: Catch2 test suite execution, progressive test levels, protocol version matrix, tx-meta baseline checks, and sanitizer builds (ASan/TSan/UBSan)
- Added explicit "Use when..." clause with natural trigger terms ("run tests", "verify a change", "check if tests pass", "run unit tests", "run regression tests", "validate code before a PR")
- Added domain-specific distinctiveness (stellar-core, Catch2) so the skill won't conflict with other testing-related skills

**Content trimming:**
- Removed the generic "Interpreting Failures" and "Common Failure Patterns" sections — Claude already understands what assertion failures, segfaults, timeouts, and sanitizer errors are. The stellar-core-specific diagnostic guidance (e.g., `--ll debug`, re-run with ASan) is already covered in the level descriptions themselves

**Unchanged:**
- All domain-specific content preserved: test tag catalog, protocol version flags, tx-meta baseline commands, sanitizer configure sequences, parallel execution patterns, ALWAYS/NEVER guardrails, subagent input/output format

</details>

I also stress-tested your `running-tests` skill against a few real-world task evals and it held up really well on multi-level test progression with `--all-versions` protocol matrix and `--rng-seed 12345` tx-meta baseline verification. Kudos for that.

Honest disclosure — I work at @tesslio where we build tooling around skills like these. Not a pitch — just saw room for improvement and wanted to contribute.

Want to self-improve your skills? Just point your agent (Claude Code, Codex, etc.) at [this Tessl guide](https://docs.tessl.io/evaluate/optimize-a-skill-using-best-practices) and ask it to optimize your skill. Ping me — [@yogesh-tessl](https://github.com/yogesh-tessl) — if you hit any snags.

# Checklist
- [x] Reviewed the [contributing](https://github.com/stellar/stellar-core/blob/master/CONTRIBUTING.md#submitting-changes) document
- [x] Rebased on top of master (no merge commits)
- [ ] ~~Ran `clang-format` v8.0.0~~ (N/A — SKILL.md only, no code changes)
- [ ] ~~Compiles~~ (N/A — SKILL.md only)
- [ ] ~~Ran all tests~~ (N/A — SKILL.md only)
- [ ] ~~If change impacts performance~~ (N/A — SKILL.md only)

Thanks in advance 🙏
---
 .claude/skills/running-tests/SKILL.md | 20 +-------------------
 1 file changed, 1 insertion(+), 19 deletions(-)
diff --git a/.claude/skills/running-tests/SKILL.md b/.claude/skills/running-tests/SKILL.md
index c62bf6da5c..ad02f8c314 100644
--- a/.claude/skills/running-tests/SKILL.md
+++ b/.claude/skills/running-tests/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: running-tests
-description: running tests at various levels from smoke tests to full suite to randomized tests
+description: "Executes stellar-core's Catch2 test suite at progressive levels: smoke tests, focused unit tests by tag, full suite with protocol version matrix, tx-meta baseline checks, and sanitizer builds (ASan/TSan/UBSan). Use when the user asks to run tests, verify a change, check if tests pass, execute the test suite, run unit tests, run regression tests, or validate code before a PR."
 ---
 
 # Overview
@@ -281,24 +281,6 @@ make clean && make -j $(nproc)
 
 This doesn't run tests but ensures the production build works.
 
-# Interpreting Failures
-
-When a test fails:
-
-1. **Identify the failing test**: Note the exact test name and file
-2. **Capture the failure output**: Save the error message and stack trace
-3. **Determine if it's a real failure**: Check if the test is flaky or if this
-   is a genuine regression
-4. **Locate the relevant code**: Find where in the changed code the failure
-   originates
-
-## Common Failure Patterns
-
-- **Assertion failure**: A test assertion didn't hold; check the condition
-- **Crash/segfault**: Memory error; run with ASan for more details
-- **Timeout**: Test took too long; may indicate infinite loop or deadlock
-- **Sanitizer error**: Memory or threading bug; the sanitizer output shows where
-
 # Output Format
 
 Report the results: